Forum: >>> Magnum BBS <<<

Re: Tonights Tradeoff - Background Execution Buffers

From MitchAlsup1@21:1/5 to Robert Finch on Tue Sep 24 20:38:52 2024

On Tue, 24 Sep 2024 20:03:29 +0000, Robert Finch wrote:

Under construction: Q+ background execution buffers for the block memory operations. For instance, a block store operation can be executed in the background while other instructions are executing. Store operations are issued when the MEM unit is not busy. Background instructions continue
to execute even when interrupts occur. The background operations may be useful for initializing blocks of memory that are not needed right-away.
When the operation is issued a handle for the buffer is returned in the destination register so that the status of the operation may be queried,
or the operation cancelled.

This is how My 66000 performs:: LDM, STM, ENTER, EXIT, MM, and MS.
Addresses are AGENED and then a state machine over in the memory
unit performs the required steps. {{Not usefully different than the
divider performing the individual steps of division.}} While the
unit performs its duties, other units can be fed and complete
other instructions.

You just have to mark the affected registers to prevent hazards.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Thu Sep 26 14:11:09 2024

On Thu, 26 Sep 2024 8:13:12 +0000, Robert Finch wrote:

On 2024-09-24 4:38 p.m., MitchAlsup1 wrote:

On Tue, 24 Sep 2024 20:03:29 +0000, Robert Finch wrote:

Under construction: Q+ background execution buffers for the block memory >>> operations. For instance, a block store operation can be executed in the >>> background while other instructions are executing. Store operations are
issued when the MEM unit is not busy. Background instructions continue
to execute even when interrupts occur. The background operations may be
useful for initializing blocks of memory that are not needed right-away. >>> When the operation is issued a handle for the buffer is returned in the
destination register so that the status of the operation may be queried, >>> or the operation cancelled.

This is how My 66000 performs:: LDM, STM, ENTER, EXIT, MM, and MS.
Addresses are AGENED and then a state machine over in the memory
unit performs the required steps. {{Not usefully different than the
divider performing the individual steps of division.}} While the
unit performs its duties, other units can be fed and complete
other instructions.

You just have to mark the affected registers to prevent hazards.

Q+ releases the registers right away, so things can continue on.
Q+ captures the register values at issue then does not modify the
registers. Did not want an instruction with three updates happening. It
keeps track of its own values. In theory anyway. Have not got to testing
it yet. A status operation might be used to query the final operation results.

Altering Q+ to use 64-bit instructions and 256 registers instead of supporting a vector instruction set. Two pipeline stages can be removed
then and it is a simpler design. Code density will decrease <200%.
Relying on software to assign registers for vectors.

Also adding a predicate field to instructions. Branches are horrendously
slow in this simple implementation. It may be faster to predicate a
dozen instructions.

The depth of predication should be such that if FETCH will "get there"
by the time the branch "resolves" that number of instructions should
be predicated.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Robert Finch on Fri Oct 4 06:19:31 2024

Robert Finch <[email protected]> writes:

Today I am wondering how many predicate registers are enough. Scanning >webpages reveals a variety. The Itanium has 64-predicates, but they are
used for modulo loops and rotated. Rotating register is Itaniums method
of register renaming, so it needs more visible registers. In a classic >superscalar design with a RAT where registers are renamed, it seems like
64 would be far too many.

Would it? Zen5 has 192 flags registers <https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/09/hc2024_zen5_spec_uplift.png?ssl=1>,
and I assume that means it has 192 C, 192 V, and 192 NZP registers
(physical), for one architectural flags register.

I cannot see the compiler making use of very many predicate registers >simultaneously.

Maybe not, but what are the alternatives:

1) Have one flags register, like AMD64 and ARM A32, T32, and A64, or
the carry flag of Power and 88K, and the flags result of most Power instructions. Then the compilers typically only know that other
instructions will overwrite that register, and is forced to consume
the flag right away. This leads to bad code generation, as shown in <[email protected]>:

|E.g., in
|<[email protected]> we see that gcc-5.3.0
|compiles
|
| cf = _addcarry_u64(cf, src1[1], src2[1], &dst[1]);
| cf = _addcarry_u64(cf, src1[2], src2[2], &dst[2]);
|
|into
|
| d: 48 8b 42 08 mov 0x8(%rdx),%rax
|11: 41 80 c1 ff add $0xff,%r9b
|15: 49 13 40 08 adc 0x8(%r8),%rax
|19: 41 0f 92 c1 setb %r9b
|1d: 48 89 41 08 mov %rax,0x8(%rcx)
|21: 48 8b 42 10 mov 0x10(%rdx),%rax
|25: 41 80 c1 ff add $0xff,%r9b
|29: 49 13 40 10 adc 0x10(%r8),%rax
|2d: 41 0f 92 c1 setb %r9b
|31: 48 89 41 10 mov %rax,0x10(%rcx)
|
|Here gcc reifies the carry bit in a GPR (r9b) with the instructions at
|19 and 2d, and also converts it from a GPR into a carry flag in 11 and
|25. This shows that the compiler does not trust itself to preserve
|the carry flag from one adc to the next.

2) Have multiple flags registers, like IA-64. The compiler will
certainly be able to deal with that, but extra instructions are needed
for generating the flags.

3) Use the GPRs for flags. This also often requires additional
instructions for generating the flags, as in MIPS, 88K, or RISC-V
(with quite a bit of differentce between the MIPS/Alpha/RISC-V
approach and the 88K approach). This disadvantage is often mitigated
by having compare-and-branch instructions or instructions that branch
on certain properties of a register's content.

4) Keep the flags results along with GPRs: have carry and overflow as
bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
The advantage is that you do not have to track the flags separately
(and, in case of AMD64, track each of C, O, and NZP separately), but
instead can use the RAT that is already there for the GPRs. You can
find a preliminary paper on that on <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.

Since they are not used simultaneously, and register
renaming is in effect, there should not be a great need for predicate >registers.

You need to preserve one instance for every recovery point, i.e.,
every instruction that branches or can trap, and that have not yet
been committed. You also need to preserve one instance if there is
any consumer that has not yet proceeded through execution. The
simplest way to satisfy both requirements is to just preserve any
flags result until the generating instruction retires. And if most instructions generate flags, that means a lot of instances of the
flags. There is a reason why Zen5 has 192.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Robert Finch on Sat Oct 5 09:43:09 2024

Robert Finch <[email protected]> writes:

On 2024-10-04 2:19 a.m., Anton Ertl wrote:

4) Keep the flags results along with GPRs: have carry and overflow as
bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
The advantage is that you do not have to track the flags separately
(and, in case of AMD64, track each of C, O, and NZP separately), but
instead can use the RAT that is already there for the GPRs. You can
find a preliminary paper on that on
<https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.

...

One solution, not mentioned in your article, is to support arithmetic
with two bits less than the number of bit a register can support, so
that the carry and overflow can be stored. On a 64-bit machine have all >operations use only 62-bits. It would solve the issue of how to load or
store the carry and overflow bits associated with a register.

Yes, that's a solution, but the question is how well existing software
would react to having no int64_t (and equivalent types, such as long
long), but instead an int62_t (or maybe int63_t, if the 64th bit is
used for both signed and unsigned overflow, by having separate signed
and unsigned addition etc.). I expect that such an architecture would
have low acceptance. By contrast, in my paper I suggest an addition
to existing 64-bit architectures that has fewer of the same
disadvantages as the widely-used condition-code-register approach has,
but still has a few of them.

Sometimes
arithmetic is performed with fewer bits, as for pointer representation.
I wonder if pointer masking could somehow be involved. It may be useful
to have a bit indicating the presence of a pointer. Also thinking of how
to track a binary point position for fixed point arithmetic. Perhaps
using the whole upper byte of a register for status/control bits would work.

There are some extensions for AMD64 in that direction.

It may be possible with Q+ to support a second destination register
which is in a subset of the GPRs. For example, one of eight registers
could be specified to holds the carry/overflow status. That effectively
ties up a second ALU though as an extra write port is needed for the >instruction.

Needing only one write port is an advantage of my approach.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Sat Oct 5 23:02:28 2024

On Fri, 4 Oct 2024 4:04:20 +0000, Robert Finch wrote:

Today I am wondering how many predicate registers are enough.

Realistically 3 as long as they are orthogonal to each other and to
code.

Scanning webpages reveals a variety. The Itanium has 64-predicates, but they are
used for modulo loops and rotated. Rotating register is Itaniums method
of register renaming, so it needs more visible registers. In a classic superscalar design with a RAT where registers are renamed, it seems like
64 would be far too many. Cray had eight vector mask registers.

In the ECL logic CRAY used an 8:1 multiplexer costs no more than a
2:1 multiplexer in power and gate count.

I think
the RISCV- Hwatcha has 16 if I looked at the diagram correctly.
I cannot see the compiler making use of very many predicate registers simultaneously. Since they are not used simultaneously, and register
renaming is in effect, there should not be a great need for predicate registers.

With the orthogonality mentioned above, 3 covers a tree with 8 branches
or a 3-deep if()then{}else[}.

Suppose one wants predicated logic in a loop with the predicate being
set outside of the loop.

1 decision per predicate,

It may be desirable to have several blocks of
logic predicated by different predicates in the loop. It is likely
desirable to have more than one predicate then.

Reserved four bits in the instruction for predicates. Do not want to

waste bits though. Using a 64-bit instruction.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Robert Finch on Wed Oct 9 14:43:37 2024

Robert Finch <[email protected]> writes:

On 2024-10-05 5:43 a.m., Anton Ertl wrote:

Been thinking some about the carry and overflow and what to do about
register spills and reloads during expression processing. My thought was
that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50.

Then if the expression is too complex, have the compiler spit out an
error message to the programmer to simplify the expression.

Completely unacceptable.

I spent several days back in 1990 fixing a temporary register
bug in PCC caused by complex expression. The expression
was generated by cfront, so there was no way to "simplify"
the expression.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Wed Oct 9 16:19:41 2024

On Wed, 9 Oct 2024 10:44:08 +0000, Robert Finch wrote:

Been thinking some about the carry and overflow and what to do about
register spills and reloads during expression processing. My thought was
that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50. Then if the expression is too complex, have the compiler spit out an error message to the programmer to simplify the expression. Remnants of
the ‘expression too complex’ error in BASIC.

Both completely unacceptable, and in your case completely unnecessary.
in 967 subroutines I read out of My 66000 LLVM compile, I only have
3 cases of spill-fill, and that is with only 32 registers with uni-
versal constants.

Of the RISC-V code I read alongside with 32+32 registers, I counted 8.

With those statistics and 256 registers, If you can't get to essentially
0 spill=fill the problem is not with your architecture but with your
compiler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to All on Sun Oct 13 16:43:53 2024

Robert Finch <[email protected]> writes:

[Context: carry and overflow in GPRs
<https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>]

Been thinking some about the carry and overflow and what to do about
register spills and reloads during expression processing. My thought was
that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50. Then if the expression is too complex, have the compiler spit out an >error message to the programmer to simplify the expression. Remnants of
the ‘expression too complex’ error in BASIC. So, there are no spills or >reloads during expression processing.

The first question is how carry and overflow are represented in the
programming language.

Currently there are programming languages with growable integers, and
overflow is needed short-term for that, so spilling the overflow bit
is probably not necessary for that (and indeed, the one overflow bit
of AMD64 or ARM A64 that is not preserved across calls is good enough
for that).

For dealing with multiple-precision integers (e.g., when the growable
integers actually grow to more than one word), typically library
routines are used, but sure, one could also have a programming
language that computes with multi-precision integers and then is
compiled into either loops over the individual words of these numbers,
or it unrolls these loops (if the length is known in advance). Yes,
if you run out of registers there, you may want to spill and refill a
register, including its carry bit. But that should be rare, so if
it's an expensive operation, we can live with it.

What we have now is things like the GNU C extension

bool __builtin_add_overflow (type1 a, type2 b, type3 *res);

This produces two different results, the return value, and res. With
the kind of architecture I have in mind, these two results could be
allocated into the same register. If at some point the register has
to be spilled, the two results can be stored into different memory
locations, and on refill they will land in different GPRs unless the
compiler writer really puts a lot more work in than is merited (I
don't expect many spills and refills).

I think the storextra / loadextra
registers used during context switching would work okay. But in Q+ there
are 256 regs which require eight storextra / loadextra registers. I
think the store extra / load extra registers could be hidden in the
context save and restore hardware. Not even requiring access via CSRs or >whatever.

Yes. In my paper I wanted to spell out an implementation that does
not look like I am ignoring some hard problems and shove it over to
the implementor. If a computer architect wants to pick my idea up,
they are welcome to implement context-switching in any way they deem appropriate.

I suppose context loads and stores could be done in blocks of
32 registers. An issue is that the load extra needs to be done before >registers are loaded.

Maybe, with 256 GPRs, you would use 8 storeextra and 8 loadextra
registers, each on associated with 32 registers. This avoids having
to make the whole process a sequential operation working on 32-GPR
blocks. Just store all 256 GPRs, sync (to get the storeextra
registers up-to-date, then store the 8 storeextra registers. For
context load, load the 8 loadextra registers, sync (so the loads of
the loadextra registers are finished), then the 256 GPRs.

Or alternatively just have 8 extra registers that are used for both
context stores and context loads. Then you cannot use the same sync
for both storing and loading, but you may prefer a little more
context-switch overhead to needing 16 extra registers.

Another thought is to store additional info such as a CRC check of the >register file on context save and restore.

Typically ECC memory and something similar in bus protocols achieve
what I guess you want to achieve with the CRC checks.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	48:24:52
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,117

Re: Tonights Tradeoff - Background Execution Buffers

Who's Online

Recent Visitors

System Info