Forum: >>> Magnum BBS <<<

Re: The Third Wish

From MitchAlsup1@21:1/5 to quadibloc on Tue Jun 24 21:57:22 2025

On Tue, 24 Jun 2025 17:19:05 +0000, quadibloc wrote:

On Mon, 23 Jun 2025 12:43:05 +0000, quadibloc wrote:

Having noted that I was using up just about the very last dregs of the
available opcode space for block headers...

I decided to dig even deeper, and create the twelfth, thirteenth, and
fourteenth header types, which, together, make the possibilities of
expanding the instruction set truly limitless, by allowing up to 128
alternate instruction sets to be added to what is currently present.

Now there are fifteen header types. The new one was added as the sixth
header type, necessitating renumbering of those that came after.

The sixth header type is a 64-bit header that prefixes a four-bit prefix
to every remaining 16 bits in the instruction.

It added the possibility of having 35-bit and 53-bit instructions.

I saw what I could use the 35-bit instructions for right away; memory-to-register operate instructions that could have all 32
registers, not just the first eight, as destinations.

But shortly afterwards, I saw a use for the 53-bit instructions: a
modified form of the string and packed decimal instructions that could
use conventional addresses with 16-bit displacements, not just the
alternate types of address with shorter displacements.

And then when I stepped back and saw what I had achieved...

it hit me what one more thing I needed to add to complete it.

So one unused code in the three-bit prefixes used in the Type III
header, carried over to the four-bit prefixes here, was now given a
purpose. It was used to indicate a "Special 16-bit instruction".

This was like the operate instructions in the 17-bit short instructions,
with a full 5-bit source register and destination register field. But
there were only six bits for the opcode. No other operations were
provided, as 17-bit short instructions already provide those.

What operations are added to the short instruction repertoire by the
special 16-bit instructions?

Mainly floating-point instructions which use the Compatible
floating-point format.

You know, the one with a sign bit, a seven bit power-of-16 exponent, and
a mantissa that can be thought of as being made up of hexadecimal
digits.

One other instruction is also included, as I don't need the whole
six-bit opcode for these instructions; the Save Return Address
instruction. I already had a way of doing this, a jump to subroutine instruction with relative addressing and a displacement of zero, but
that's 32 bits long.

So with the 25% overhead of the Type VI header... I now have achieved
almost complete isomorphism with a popular make of computer... that is,
the ability to perform the same function as any of its instructions with
an instruction that is, sort of, the same length. Thus simplifying
program conversion.

That and $5 will buy you a cup of coffee--until you write a compiler
and start counting how many/few instructions you need to perform
applications.

Of course, I'm sure there are a few instructions not covered, but this handles at least the bunch of its ordinary non-privileged instructions.

For the record, My 66000 has a single instruction that has any notion
of privilege. How many do you think you will need ?!?

So from what seemed to be wretched excess that would risk making me dissatisfied with this iteration of the architecture, I have instead
achieved something that makes me very satisfied and reluctant to move
on.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed Jun 25 07:31:55 2025

On Tue, 24 Jun 2025 21:57:22 +0000, MitchAlsup1 wrote:

On Tue, 24 Jun 2025 17:19:05 +0000, quadibloc wrote:

So with the 25% overhead of the Type VI header... I now have achieved
almost complete isomorphism with a popular make of computer... that is,
the ability to perform the same function as any of its instructions
with an instruction that is, sort of, the same length. Thus simplifying
program conversion.

That and $5 will buy you a cup of coffee--until you write a compiler and start counting how many/few instructions you need to perform
applications.

Well, once I get the spec frozen, I'll be closer to being able to do such things.

Of course, I'm sure there are a few instructions not covered, but this
handles at least the bunch of its ordinary non-privileged instructions.

For the record, My 66000 has a single instruction that has any notion of privilege. How many do you think you will need ?!?

As I haven't tried to improve or simplify this part of computer
architecture, presumably as many as any typical modern computer
architecture, like z/Architecture or x86.

Incidentally, the sixth header type is now the fourth; this was in order
to give it extra opcode space, so that I could add one bit to the header
with an additional function. If this header type allows me to imitate the instruction set of S/360, I have a bit to imitate an aspect of its
register layout so as to facilitate communication with actual S/360 code running in emulation on the same chip.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Fri Jun 27 05:28:47 2025

An additional minor tweak has been added to the architecture.

Since there is a header type (currently type VIII) which allows
full VLIW functionality with variable-length instructions, and
that header type had several spare bits available within it,
one has now been purposed to indicate a modification to the
17-bit short instructions which are available when variable-length
instructions are used.

The architecture includes extended register banks with 128 registers
in each of them, in addition to the regular register banks with 32
registers. VLIW-style code is the kind of code likely to find using
the extended registers to be of benefit; so, now, selecting the
alternate form of the 17-bit instructions causes the normal 17-bit
instructions that work with the normal banks of 32 registers to ones
that work with the extended banks of 128 registers - thus allowing
more threads to be interleaved, thus keeping dependent instructions
further away from each other, just the thing for allowing more
instructions to execute in parallel.

Also added is a colorful diagram making it plain what I am up to
when I modify floating-point register usage with the Emulation bit
in headers of Type IV.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed Jul 2 05:16:13 2025

I have noted that the amount of opcode space available for headers
had been needlessly compromised because of a faulty ordering of memory-reference instructions, so I re-ordered them to make the
available space less constricting.
As well, I have had complaints that my use of base-index addressing
as opposed to the plain addressing of actual RISC instruction sets
is not really helpful, since the value placed in the index register
often has to be calculated in any case, and so only one arithmetic
step is saved, not the need for calculation.
Therefore, I have placed a new header in the third position in my
list of headers which is 16 bits long, and which divides the remainder
of the block into five 48-bit areas. These may contain either two
24-bit instructions, which are operate instructions of the RISC type,
or three 15-bit instructions, of my own special short instruction type,
or one such 15-bit instruction, and a 30-bit memory-reference
instruction of the typical RISC type.
Thus not only is the classic RISC style of programming accomodated, if
perhaps still imperfectly, but another avenue is provided to make
programs at least slightly more compact, with ten instructions placed
in the space of eight.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Wed Jul 2 07:04:06 2025

On Wed, 02 Jul 2025 05:16:13 +0000, John Savard wrote:

Thus not only is the classic RISC style of programming accomodated, if perhaps still imperfectly, but another avenue is provided to make
programs at least slightly more compact, with ten instructions placed in
the space of eight.

I have found a way to address the imperfections, so this new header now
allows porting RISC code to Concertina II with little conversion.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Thu Jul 3 05:52:00 2025

On Wed, 02 Jul 2025 07:04:06 +0000, John Savard wrote:

On Wed, 02 Jul 2025 05:16:13 +0000, John Savard wrote:

Thus not only is the classic RISC style of programming accomodated, if
perhaps still imperfectly, but another avenue is provided to make
programs at least slightly more compact, with ten instructions placed
in the space of eight.

I have found a way to address the imperfections, so this new header now allows porting RISC code to Concertina II with little conversion.

My bizarre version of RISC code, in which memory-reference instructions
are 30 bits long, while operate instructions are 24 bits long, has now
been accompanied by a different instruction format requiring another form
of the new Type III header.

This one involves dividing the block into three parts instead of five, and
each of those three parts may contain three instructions that are 27 bits
long or thereabouts.

I think it will look familiar, and will represent the absolute height of insanity, the temptation to add to Concertina II being too strong for me
to resist.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Stephen Fuld on Thu Jul 3 06:59:37 2025

Stephen Fuld <[email protected]d> schrieb:

On 7/2/2025 10:52 PM, John Savard wrote:

I think it will look familiar, and will represent the absolute height of
insanity, the temptation to add to Concertina II being too strong for me
to resist.

By my count, you have just under a "gazillion" instruction, instruction formats, etc. :-)

Have you figured out how much combinatorial logic and how many gate
delays it will take to decode all of them? That might help to limit your "insanity".

That is an excellent idea. John, if write down the Boolean
equations or the truth tables for your instruction decoding, then
try to simplify them with espresso or a tool which does multi-level
logic optimization like Berkeley ABC, you will get a much better
idea of how complicated your design actually is. ABC also does some
delay calculations for you if you map your design to a library.
Highly instructive.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Savard on Wed Jul 2 22:57:19 2025

On 7/2/2025 10:52 PM, John Savard wrote:

On Wed, 02 Jul 2025 07:04:06 +0000, John Savard wrote:

On Wed, 02 Jul 2025 05:16:13 +0000, John Savard wrote:

Thus not only is the classic RISC style of programming accomodated, if
perhaps still imperfectly, but another avenue is provided to make
programs at least slightly more compact, with ten instructions placed
in the space of eight.

I have found a way to address the imperfections, so this new header now
allows porting RISC code to Concertina II with little conversion.

My bizarre version of RISC code, in which memory-reference instructions
are 30 bits long, while operate instructions are 24 bits long, has now
been accompanied by a different instruction format requiring another form
of the new Type III header.

This one involves dividing the block into three parts instead of five, and each of those three parts may contain three instructions that are 27 bits long or thereabouts.

I think it will look familiar, and will represent the absolute height of insanity, the temptation to add to Concertina II being too strong for me
to resist.

By my count, you have just under a "gazillion" instruction, instruction formats, etc. :-)

Have you figured out how much combinatorial logic and how many gate
delays it will take to decode all of them? That might help to limit your "insanity".

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu Jul 3 11:35:33 2025

Without an explicit indication of parallelism, the real
theoretical maximum is sixteen-way, with 14-bit instructions
in the case without headers, and there's nothing much I can
do to improve the ease of access to that.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Thomas Koenig on Thu Jul 3 09:43:55 2025

On Thu, 03 Jul 2025 06:59:37 +0000, Thomas Koenig wrote:

Stephen Fuld <[email protected]d> schrieb:

By my count, you have just under a "gazillion" instruction, instruction
formats, etc. :-)

Have you figured out how much combinatorial logic and how many gate
delays it will take to decode all of them? That might help to limit
your "insanity".

That is an excellent idea. John, if write down the Boolean equations or
the truth tables for your instruction decoding, then try to simplify
them with espresso or a tool which does multi-level logic optimization
like Berkeley ABC, you will get a much better idea of how complicated
your design actually is. ABC also does some delay calculations for you
if you map your design to a library. Highly instructive.

In any case, until sanity overtakes me, and I remove this new feature
from the Concertina II design, I have modified it to add instructions
which make use of the extended register banks. I mean, really: how can
I possibly omit the most important attribute required to give this
instruction format its rightful Itanium nature?

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Thu Jul 3 11:24:31 2025

On Thu, 03 Jul 2025 09:43:55 +0000, John Savard wrote:

In any case, until sanity overtakes me, and I remove this new feature
from the Concertina II design, I have modified it to add instructions
which make use of the extended register banks. I mean, really: how can I possibly omit the most important attribute required to give this
instruction format its rightful Itanium nature?

Also, this exercise had a useful consequence. Adding a new instruction
format that made it easy to achieve code that can take advantage of
nine-way superscalar operation led me to review what the rest of the instruction set was doing.

A previous addition made ten-way superscalar operation possible, but
without any explicit indication of parallelism to promote it.

With 17-bit short instructions, fourteen-way superscalar operation can
be called upon without an explicit indication of parallelism; with one,
though, that drops to eleven-way.

But the maximum of 14-way could only be called upon with 14-bit
instructions in the case where the pairs of instructions, at least,
had an explicit indication of parallelism. I had enough opcode space
available so that I was able to improve this to also use 15-bit
instructions, to at least make it slightly more likely that the full superscalar power potentially available in this design could be used.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Stephen Fuld on Tue Jul 15 22:24:14 2025

On Wed, 02 Jul 2025 22:57:19 -0700, Stephen Fuld wrote:

By my count, you have just under a "gazillion" instruction, instruction formats, etc.

Well, I have now removed the pseudo-RISC and imitation Itanium
instruction formats. Instead, I've added 18-bit short instructions,
which I feel are a more appropriate and effective way of reaching
the goal of improving the ability of programs to make use of the
superscalar potential of implementations of the architecture.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Wed Jul 16 00:00:47 2025

On Tue, 15 Jul 2025 22:24:14 +0000, John Savard wrote:

On Wed, 02 Jul 2025 22:57:19 -0700, Stephen Fuld wrote:

By my count, you have just under a "gazillion" instruction, instruction
formats, etc.

Well, I have now removed the pseudo-RISC and imitation Itanium
instruction formats. Instead, I've added 18-bit short instructions,
which I feel are a more appropriate and effective way of reaching the
goal of improving the ability of programs to make use of the superscalar potential of implementations of the architecture.

Also, it might be noted that I did go to some lengths to ensure that all
the instructions in a block could be decoded in parallel, to help deal
with the gate delays involved in instruction decoding. Of course, with
that step pipelined, the delay is only felt once per branch rather than
for every instruction.

Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions - and
the 18-bit instructions fell short of my goal in one important respect.
So I think they will be added soon as well.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 01:11:24 2025

On Thu, 3 Jul 2025 11:35:33 +0000, John Savard wrote:

Without an explicit indication of parallelism, the real
theoretical maximum is sixteen-way, with 14-bit instructions
in the case without headers, and there's nothing much I can
do to improve the ease of access to that.

As noted above, I am doing 16-way decode on variable length
instructions with no marking bits.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Wed Jul 16 00:22:39 2025

On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions - and
the 18-bit instructions fell short of my goal in one important respect.
So I think they will be added soon as well.

I thought I was going to have to postpone it until I got back from doing
an errand for a friend, but this was so simple and quick an addition that
I have managed to post it to the pages, specifically:

http://www.quadibloc.com/arch/ct23int.htm http://www.quadibloc.com/arch/cad0101.htm

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 01:08:19 2025

On Thu, 3 Jul 2025 11:24:31 +0000, John Savard wrote:

On Thu, 03 Jul 2025 09:43:55 +0000, John Savard wrote:

In any case, until sanity overtakes me, and I remove this new feature
from the Concertina II design, I have modified it to add instructions
which make use of the extended register banks. I mean, really: how can I
possibly omit the most important attribute required to give this
instruction format its rightful Itanium nature?

Also, this exercise had a useful consequence. Adding a new instruction
format that made it easy to achieve code that can take advantage of
nine-way superscalar operation led me to review what the rest of the instruction set was doing.

A previous addition made ten-way superscalar operation possible, but
without any explicit indication of parallelism to promote it.

With 17-bit short instructions, fourteen-way superscalar operation can
be called upon without an explicit indication of parallelism; with one, though, that drops to eleven-way.

We cannot believe that until you produce a compiler.

OH and btw, I can achieve 16-way decode parallelism with a variable
length encoding and nothing that marks any kind of instruction
boundary--AND--I have compiler, linker, ...

Nor do I have a zillion instructions, I only have 63 patterns to
recognize.

But the maximum of 14-way could only be called upon with 14-bit
instructions in the case where the pairs of instructions, at least,
had an explicit indication of parallelism. I had enough opcode space available so that I was able to improve this to also use 15-bit
instructions, to at least make it slightly more likely that the full superscalar power potentially available in this design could be used.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Wed Jul 16 01:49:50 2025

On Wed, 16 Jul 2025 00:22:39 +0000, John Savard wrote:

On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions -
and the 18-bit instructions fell short of my goal in one important
respect. So I think they will be added soon as well.

I thought I was going to have to postpone it until I got back from doing
an errand for a friend, but this was so simple and quick an addition
that I have managed to post it to the pages, specifically:

http://www.quadibloc.com/arch/ct23int.htm http://www.quadibloc.com/arch/cad0101.htm

But having 14, 15, 16, 17, 18 and 19 bit long short instructions is
definitely symptomatic of the issue you've identified.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Wed Jul 16 12:26:29 2025

On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions - and
the 18-bit instructions fell short of my goal in one important respect.
So I think they will be added soon as well.

Not only have I done that, but now, by dint of a great effort, I've
squeezed out enough opcode space to get the header format I _really_
wanted that allows 14-way superscalar operation when only instructions
of the alternate 17-bit short instruction format are needed in a given
block.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 18:06:59 2025

On Wed, 16 Jul 2025 1:49:50 +0000, John Savard wrote:

On Wed, 16 Jul 2025 00:22:39 +0000, John Savard wrote:

On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions -
and the 18-bit instructions fell short of my goal in one important
respect. So I think they will be added soon as well.

I thought I was going to have to postpone it until I got back from doing
an errand for a friend, but this was so simple and quick an addition
that I have managed to post it to the pages, specifically:

http://www.quadibloc.com/arch/ct23int.htm
http://www.quadibloc.com/arch/cad0101.htm

But having 14, 15, 16, 17, 18 and 19 bit long short instructions is definitely symptomatic of the issue you've identified.

John Savard

Conversing with yourself or arguing with yourself ?? !!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Jul 16 18:09:30 2025

On Thu, 3 Jul 2025 6:59:37 +0000, Thomas Koenig wrote:

Stephen Fuld <[email protected]d> schrieb:

On 7/2/2025 10:52 PM, John Savard wrote:

I think it will look familiar, and will represent the absolute height of >>> insanity, the temptation to add to Concertina II being too strong for me >>> to resist.

By my count, you have just under a "gazillion" instruction, instruction
formats, etc. :-)

Have you figured out how much combinatorial logic and how many gate
delays it will take to decode all of them? That might help to limit your
"insanity".

That is an excellent idea. John, if write down the Boolean
equations or the truth tables for your instruction decoding, then

About 25 lines of Verilog for My 66000 (the logic not the data structure definitions)

try to simplify them with espresso or a tool which does multi-level
logic optimization like Berkeley ABC, you will get a much better
idea of how complicated your design actually is. ABC also does some
delay calculations for you if you map your design to a library.
Highly instructive.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Wed Jul 16 17:33:41 2025

John Savard <[email protected]d> schrieb:

On Wed, 16 Jul 2025 00:22:39 +0000, John Savard wrote:

On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions -
and the 18-bit instructions fell short of my goal in one important
respect. So I think they will be added soon as well.

I thought I was going to have to postpone it until I got back from doing
an errand for a friend, but this was so simple and quick an addition
that I have managed to post it to the pages, specifically:

http://www.quadibloc.com/arch/ct23int.htm
http://www.quadibloc.com/arch/cad0101.htm

But having 14, 15, 16, 17, 18 and 19 bit long short instructions is definitely symptomatic of the issue you've identified.

Who is "you"? You didn't quote anybody else but yourself in there.

I'm a bit mystified...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Wed Jul 16 18:22:10 2025

On Wed, 16 Jul 2025 1:08:19 +0000, MitchAlsup1 wrote:

On Thu, 3 Jul 2025 11:24:31 +0000, John Savard wrote:

On Thu, 03 Jul 2025 09:43:55 +0000, John Savard wrote:

In any case, until sanity overtakes me, and I remove this new feature
from the Concertina II design, I have modified it to add instructions
which make use of the extended register banks. I mean, really: how can I >>> possibly omit the most important attribute required to give this
instruction format its rightful Itanium nature?

Also, this exercise had a useful consequence. Adding a new instruction
format that made it easy to achieve code that can take advantage of
nine-way superscalar operation led me to review what the rest of the
instruction set was doing.

A previous addition made ten-way superscalar operation possible, but
without any explicit indication of parallelism to promote it.

With 17-bit short instructions, fourteen-way superscalar operation can
be called upon without an explicit indication of parallelism; with one,
though, that drops to eleven-way.

We cannot believe that until you produce a compiler.

OH and btw, I can achieve 16-way decode parallelism with a variable
length encoding and nothing that marks any kind of instruction boundary--AND--I have compiler, linker, ...

Nor do I have a zillion instructions, I only have 63 patterns to
recognize.

I should expand on this::
There are 4 groups of instructions where we use the top 2-bits of
the major OpCode::

00 OpCode extensions
01 Control transfer group
10 Memory reference with 16-bit displacement
11 Calculation with 16-bit immediate

Of these::

000 Predication and shifts of constant (saves imm16 space 12-bits imm)
001 is the only group that has variable length
010 is LOOP
011 is conventional branch
100 it LDs with disp16
101 is STs with disp16
110 is integer with imm16
111 is logical with imm16

All very RISC-like at this point. In the VLE group; Inst<15:13,11>
provide all the bits for operand routing and for VLE instruction
length--they are mashed up together to reduce entropy.

But the maximum of 14-way could only be called upon with 14-bit
instructions in the case where the pairs of instructions, at least,
had an explicit indication of parallelism. I had enough opcode space
available so that I was able to improve this to also use 15-bit
instructions, to at least make it slightly more likely that the full
superscalar power potentially available in this design could be used.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Thomas Koenig on Wed Jul 16 23:24:52 2025

On Wed, 16 Jul 2025 17:33:41 +0000, Thomas Koenig wrote:

John Savard <[email protected]d> schrieb:

But having 14, 15, 16, 17, 18 and 19 bit long short instructions is
definitely symptomatic of the issue you've identified.

Who is "you"? You didn't quote anybody else but yourself in there.

Whoever noted I had zillions of instructions and instruction formats.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Thomas Koenig on Wed Jul 16 23:26:53 2025

On Wed, 16 Jul 2025 17:33:41 +0000, Thomas Koenig wrote:

John Savard <[email protected]d> schrieb:

But having 14, 15, 16, 17, 18 and 19 bit long short instructions is
definitely symptomatic of the issue you've identified.

Who is "you"? You didn't quote anybody else but yourself in there.

Stephen Fuld.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed Jul 16 23:36:11 2025

On Wed, 16 Jul 2025 01:08:19 +0000, MitchAlsup1 wrote:

OH and btw, I can achieve 16-way decode parallelism with a variable
length encoding and nothing that marks any kind of instruction boundary--AND--I have compiler, linker, ...

Nor do I have a zillion instructions, I only have 63 patterns to
recognize.

I certainly acknowledge that I'm not as good as you at this sort
of thing.

Theoretically, because the architecture involves separate banks of floating-point and integer registers, and there are both regular banks
with 32 registers, and extended banks with 128 registers, and
instruction formats that divide these registers into eight-register
groups (sort of like a register window, but not quite)... if it weren't
for the fact that I envisage only fetching 256 bits from memory in any
given cycle (of course, within loops, one can get instructions from
internal cache) this theoretically allows for 40-way superscalar
operation.

In practice, I doubt that anyone would write code, even carefully by
hand, that would even manage 14-way superscalar operation for very
long, so I admit it's unlikely to be terribly useful to include this in
most implementations.

The ISA is designed, though, so that (except for its immense bloat) it
could be used in a special-purpose CPU without OoO that's designed for
some kind of embedded use in, say, image processing or something where
it could be given that kind of specialized code to run.

A CPU designed instead for use in a desktop workstation would presumably
have microarchitectural capabilities appropriate to that application.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 23:58:41 2025

On Wed, 16 Jul 2025 23:36:11 +0000, John Savard wrote:

On Wed, 16 Jul 2025 01:08:19 +0000, MitchAlsup1 wrote:

OH and btw, I can achieve 16-way decode parallelism with a variable
length encoding and nothing that marks any kind of instruction
boundary--AND--I have compiler, linker, ...

Nor do I have a zillion instructions, I only have 63 patterns to
recognize.

I certainly acknowledge that I'm not as good as you at this sort
of thing.

Theoretically, because the architecture involves separate banks of floating-point and integer registers, and there are both regular banks
with 32 registers, and extended banks with 128 registers, and

Point of order:: all register files that have the same width (64-bits)
should be a single file. This makes varargs easier, allows using integer operations on FP operands (extract exponent, insert exponent, copysign)
which are mandated by the standards. Either you have an integer set of registers and a FP set of registers and a nearly complete set of integer operations on FP registers, or you can dispense with the nonsense and
have a single general purpose register file.

I have evidence (data) indicating My 66000 with only 32-registers
AND universal constants needs fewer registers than RISC-V with
32 integer and 32 FP registers on many applications, including
those you think need 32+32.

instruction formats that divide these registers into eight-register
groups (sort of like a register window, but not quite)... if it weren't
for the fact that I envisage only fetching 256 bits from memory in any
given cycle (of course, within loops, one can get instructions from
internal cache) this theoretically allows for 40-way superscalar
operation.

I have always been a cynic to this partition.

In practice, I doubt that anyone would write code, even carefully by
hand, that would even manage 14-way superscalar operation for very
long, so I admit it's unlikely to be terribly useful to include this in
most implementations.

That is why VLIW is failing or has failed.

The ISA is designed, though, so that (except for its immense bloat) it
could be used in a special-purpose CPU without OoO that's designed for
some kind of embedded use in, say, image processing or something where
it could be given that kind of specialized code to run.

LoL.

A CPU designed instead for use in a desktop workstation would presumably
have microarchitectural capabilities appropriate to that application.

Like fast context switches, which multiple register files PREVENTS !!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu Jul 17 03:42:08 2025

On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:

Point of order:: all register files that have the same width (64-bits)
should be a single file.

This relates to a point that occurred to me.

Many CISC microprocessors had register banks of eight registers.

RISC had register banks of 32 registers, which they thought would
avoid the need for OoO. Increasing performance demands, though,
made that no longer true.

Well, then, the extended register banks with 128 registers in them...

are there to be used by programs intended to run on chips that don't
have OoO. If an implementation does have OoO, nothing is to be gained
by bothering with those registers (which still won't have rename
registers associated with them, even in an OoO implementation; so
OoO won't work on the parts of the program that use them).

Like fast context switches, which multiple register files PREVENTS !!

You could have an operating system that neglects to save certain register
files on interrupts, which means programs can't use them. (There's a
precedent: the Commodore 64 didn't save the status bit for decimal mode,
so user programs couldn't use that feature of the 6502.)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Thu Jul 17 04:01:44 2025

On Thu, 17 Jul 2025 03:42:08 +0000, John Savard wrote:

On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:

Point of order:: all register files that have the same width (64-bits)
should be a single file.

This relates to a point that occurred to me.

Many CISC microprocessors had register banks of eight registers.

RISC had register banks of 32 registers, which they thought would avoid
the need for OoO. Increasing performance demands, though,
made that no longer true.

Well, then, the extended register banks with 128 registers in them...

are there to be used by programs intended to run on chips that don't
have OoO. If an implementation does have OoO, nothing is to be gained by bothering with those registers (which still won't have rename registers associated with them, even in an OoO implementation; so OoO won't work
on the parts of the program that use them).

And there are other things related to this that have occurred to me.

You've used the term "GBOoO" - Great Big out-of-order - to describe
the current offerings of companies like Intel and AMD.

You hadn't formally defined the term, at least not in any post that
I've noticed. For purposes of discussion below, I'm going to provide
a definition which may not correspond to what you were intending.

This definition is:

In "normal" out-of-order, each register has three rename registers
associated with it, for a total of 4.

In "big" out-of-order, each register has fifteen rename registers
associated with it, for a total of 16.

In "great big" out of order, each register has sixty-three rename
registers associated with it, for a total of 64.

With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.

In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO. Since the CISC chips had register files of 8
registers, and the RISC chips had register files of 32 registers,
the two were equivalent in performance. (Given cache misses,
maybe the RISC chips still needed scoreboards.)

And then the RISC chips got "normal" OoO, and to keep up, the
CISC chips got "big" OoO.

This is the stage we would be at when I say that in my design,
the 32-register normal register banks would have OoO, but the
128-register extended register banks wouldn't.

If things have progressed further, so that RISC chips have "big"
OoO and CISC chips have "great big" OoO, by the definitions I've
given above, then my design, to keep up, would have to provide
"big" OoO for the normal register files, but only "normal" OoO
for the extended register files.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Thu Jul 17 04:10:18 2025

On Thu, 17 Jul 2025 04:01:44 +0000, John Savard wrote:

If things have progressed further, so that RISC chips have "big" OoO and
CISC chips have "great big" OoO, by the definitions I've given above,
then my design, to keep up, would have to provide "big" OoO for the
normal register files, but only "normal" OoO for the extended register
files.

I think that a clarification is in order here.

I don't know, but I strongly suspect, that what you term
"Great Big out of order" is what I've called just "big"
out-of-order, and that this is already well past the point
of diminishing returns.

So that what I've called "great big" out of order is
instead something so far past the point of sanity that
I don't have to worry about it happening in real life.

But I could be wrong.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Thu Jul 17 04:24:18 2025

On Thu, 17 Jul 2025 04:10:18 +0000, John Savard wrote:

I think that a clarification is in order here.

And, come to think of it, _another_ clarification may be needed.

If a register file with "normal" out-of-order, three rename
registers for each register, and 32 registers to a register bank, is

"equivalent" in performance to a register file with 128 registers and
no OoO support,

then the latter provides no performance benefit, so what is it there
for?

Someone might ask that who wasn't following my discussion of the
Concertina II design. So I think I had better re-iterate the point:

What the 128-register extended register files are _for_ is to
provide better performance on implementations that don't have OoO
at all, for any of the registers. They're also present on
implementations with OoO for *compatibility* reasons.

This is what may not be clear to some.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 14:59:27 2025

On Thu, 17 Jul 2025 4:01:44 +0000, John Savard wrote:

On Thu, 17 Jul 2025 03:42:08 +0000, John Savard wrote:

On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:

Point of order:: all register files that have the same width (64-bits)
should be a single file.

This relates to a point that occurred to me.

<snip>

And there are other things related to this that have occurred to me.

You've used the term "GBOoO" - Great Big out-of-order - to describe
the current offerings of companies like Intel and AMD.

You hadn't formally defined the term, at least not in any post that
I've noticed. For purposes of discussion below, I'm going to provide
a definition which may not correspond to what you were intending.

This definition is:

In "normal" out-of-order, each register has three rename registers
associated with it, for a total of 4.

The in-order Mc 88110 had 2 rename registers per register.

In "big" out-of-order, each register has fifteen rename registers
associated with it, for a total of 16.

In "great big" out of order, each register has sixty-three rename
registers associated with it, for a total of 64.

Mc 88120 has 96 rename registers and 32 architectural registers
in a pool of 128. Rk could be renamed 96 times--so, if you wrote
code using a single register, it could still fill the execution
window with work.

This is what I mean by GBOoO::
Instructions can be issued out of DECODE into instructions queues
until an instruction queue becomes full and another instruction
needs that queue;
AND
The register rename pool continues to have register renames available.

So DECODE does not stall until it runs out of register renames and
still has instruction queue entries to capture issued instructions.

This was the Mc 88120 design point by 1992.

With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.

In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO. Since the CISC chips had register files of 8
registers, and the RISC chips had register files of 32 registers,
the two were equivalent in performance. (Given cache misses,
maybe the RISC chips still needed scoreboards.)

I think it is fairer to say that with big compiler programmable
register files, CISC needed GBOoO before RISC did.

Secondarily, CISC was supported by big-$$$ cache flow and could
afford the silicon costs and design team costs before RISCs
could.

And then the RISC chips got "normal" OoO, and to keep up, the
CISC chips got "big" OoO.

As noted above, Mc 88120 was GBOoO in design in 1992 targeting
1995 for production. The PowerPC distraction took not just 88K
down, but Moto too.

This is the stage we would be at when I say that in my design,
the 32-register normal register banks would have OoO, but the
128-register extended register banks wouldn't.

If things have progressed further, so that RISC chips have "big"
OoO and CISC chips have "great big" OoO, by the definitions I've
given above, then my design, to keep up, would have to provide
"big" OoO for the normal register files, but only "normal" OoO
for the extended register files.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 15:02:15 2025

On Thu, 17 Jul 2025 4:10:18 +0000, John Savard wrote:

On Thu, 17 Jul 2025 04:01:44 +0000, John Savard wrote:

If things have progressed further, so that RISC chips have "big" OoO and
CISC chips have "great big" OoO, by the definitions I've given above,
then my design, to keep up, would have to provide "big" OoO for the
normal register files, but only "normal" OoO for the extended register
files.

I think that a clarification is in order here.

I don't know, but I strongly suspect, that what you term
"Great Big out of order" is what I've called just "big"
out-of-order, and that this is already well past the point
of diminishing returns.

So that what I've called "great big" out of order is
instead something so far past the point of sanity that
I don't have to worry about it happening in real life.

Another way to look at it is::

Once the instruction queueing mechanism gets big enough that
instruction scheduling is no longer needed, the compiler's
job is simply to produce the fewest number of instructions
that performs the semantic duties at hand.

But I could be wrong.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 14:49:16 2025

On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:

Point of order:: all register files that have the same width (64-bits)
should be a single file.

This relates to a point that occurred to me.

Many CISC microprocessors had register banks of eight registers.

RISC had register banks of 32 registers, which they thought would
avoid the need for OoO. Increasing performance demands, though,
made that no longer true.

Somewhat true, but basically untrue--from a 1983 perspective::

Some RISC marketeers stated that the big RF was like a programmable
cache--even though it was not.

We the designers and architects know OoO was waiting until enough
transistors could be had.

Well, then, the extended register banks with 128 registers in them...

AMD 29K ?!?

are there to be used by programs intended to run on chips that don't
have OoO. If an implementation does have OoO, nothing is to be gained
by bothering with those registers (which still won't have rename
registers associated with them, even in an OoO implementation; so
OoO won't work on the parts of the program that use them).

Somewhat right:: given a pool of 128 (or 256) rename registers,
one can make even an 8-register machine run fast and rather
well. The thing is that a 32-register ISA has a 15%-18% speed
advantage over an 8-register machine--whereas a 64 register
machine only has a 3% speed advantage over a 32 register
machine (MIPS circa 1982). At some point not having enough
registers hurts (that somewhere is in the mid 20-s of registers)
and at some point the size of the RF limits read performance
(that somewhere is between 32 registers and 64 registers).

1st generation RISC machines would read RF in ½ cycle and
perform forwarding in that same decode (1) cycle. As machines
got faster and as renaming took hold reading RF became 1 cycle
and forwarding became another 1 cycle (around 2 GHz).

Like fast context switches, which multiple register files PREVENTS !!

You could have an operating system that neglects to save certain
register
files on interrupts, which means programs can't use them. (There's a precedent: the Commodore 64 didn't save the status bit for decimal mode,
so user programs couldn't use that feature of the 6502.)

Any system that is not completely transparent to interrupts is
a pain in the a$$ for user applications.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 15:16:59 2025

On Thu, 17 Jul 2025 4:24:18 +0000, John Savard wrote:

On Thu, 17 Jul 2025 04:10:18 +0000, John Savard wrote:

I think that a clarification is in order here.

And, come to think of it, _another_ clarification may be needed.

If a register file with "normal" out-of-order, three rename
registers for each register, and 32 registers to a register bank, is

"equivalent" in performance to a register file with 128 registers and
no OoO support,

then the latter provides no performance benefit, so what is it there
for?

That is NOT HOW ONE RENAMES !!!!!!!

One has a pool of rename registers. DECODE has a demand for rename
registers (Rd), and retire has a supply of rename registers (Write RD
into RF), and ANY rename register can stand in for ANY architectural
register. As long as the pool is not depleted, everything runs smoothly.

Someone might ask that who wasn't following my discussion of the
Concertina II design. So I think I had better re-iterate the point:

What the 128-register extended register files are _for_ is to
provide better performance on implementations that don't have OoO
at all, for any of the registers. They're also present on
implementations with OoO for *compatibility* reasons.

........... CPU time . Reg Access
.8 registers 1.30 .... 1/4
16 registers 1.15 .... 1/3
32 registers 1.00 .... 2/4
64 registers 0.97 .... 3/4
128 registers 0.96 .... 4/4

32 is the knee of the curve. HW always wants to operate at the knee
of the curve (Bill Moyer 1982). If you make the Register file as
big as the cache it will take just as long as the cache to access
(Andy Glew circa 1995).

The only good arguments I have heard wrt big architectural register
files has to do with things like Register-Windows and/or optimizing
CALL/RET interface.

BUT (the big but) adding cycles to the pipeline degrades performance
for ALL instructions, not just the ones that use registers 32..128 !!
{{like doubling the size of L2 and adding 1 cycle of added latency
ends up running slower 50% of the time--choose your L2 latency with
care (Przybylski).}}

This is what may not be clear to some.

Some == You

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Savard on Thu Jul 17 16:28:56 2025

John Savard <[email protected]d> writes:

With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.

In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO.

Very mythical.

No VAX (*the* CISC at the time of RISC development) implementation had
OoO, ever.

No IA-32 (CISC) implementation had OoO up to October 31, 1995, i.e.,
during the first 10 years of IA-32's existence.

HP-PA (RISC) had an OoO implementation since November 2, 1995, i.e.,
one day after IA-32.

MIPS (RISC) had an OoO implementation since January 1996, i.e., two
months after IA-32.

Since the CISC chips had register files of 8
registers,

VAX (CISC) has 15 GPRs. S/360 (CISC) has 16 GPRs.

and the RISC chips had register files of 32 registers,

ARM A32 hast 16 registers.

(Given cache misses,
maybe the RISC chips still needed scoreboards.)

Apart from MIPS R2000/R3000, every RISC has waited for results to
become ready. In the beginning stopping the pipeline was a way to do
this, but once more silicon became available and performance demands
increased, other instructions often continued as far as possible.
That was often called scoreboarding, but Mitch Alsup tells us that scoreboarding on the 66000 was something more sophisticated.

And then the RISC chips got "normal" OoO, and to keep up, the
CISC chips got "big" OoO.

It would be an interesting task to unearth some PA-8000 or R10000
machine, and use Henry Wong's (IIRC) methodology to determine the
reorder buffer size and the renaming capacity of these chips. For the Coppermine Pentium III the reorder buffer size is 40 entries.

If things have progressed further, so that RISC chips have "big"
OoO and CISC chips have "great big" OoO

You can find my collection of sizes of OoO resources on

https://www.complang.tuwien.ac.at/anton/robsize/

If we look in the year 2020, we see the following number of physical
GPRs:

192 Zen3 (AMD64, CISC)
280 Sunny/Willo Cove (AMD64, CISC)
354 Apple M1 Firestorm (ARM A64, RISC)
192 Samsung M5 (ARM A64, RISC)

So there does not seem to be a systematic difference in the number of
physical GPRs between CISC and RISC in recent years.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu Jul 17 18:03:25 2025

On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:

On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

You could have an operating system that neglects to save certain
register files on interrupts, which means programs can't use them.
(There's a precedent: the Commodore 64 didn't save the status bit for
decimal mode,
so user programs couldn't use that feature of the 6502.)

Any system that is not completely transparent to interrupts is a pain in
the a$$ for user applications.

In no way am I denying this.

As you noted that having multiple register files makes context switching slower, though, I was simply noting that... one can always just ignore
the extra register files. Making them useless, by not saving them
and restoring them in the interrupt handler, forces user programs to
avoid using those registers. And if they can't use them, they don't have
to save and restore them, and so context switching is speeded up!

An ISA can be implemented in different ways, and it can be used in
different ways.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Savard on Thu Jul 17 17:06:50 2025

John Savard <[email protected]d> writes:

You could have an operating system that neglects to save certain register >files on interrupts, which means programs can't use them. (There's a >precedent: the Commodore 64 didn't save the status bit for decimal mode,
so user programs couldn't use that feature of the 6502.)

That's bullshit. The 6502/6510 saves the P register, which contains
the decimal bit on an interrupt, and RTI restores it. What is
apparently necessary is that the interrupt handler clears (or sets)
the decimal flag if it uses adc or sbc. But given that some people
have actually made use of decimal mode <https://www.lemon64.com/forum/viewtopic.php?p=63214&sid=4146454cf46e58e74dff37b0a5f5f6b4#p63214>,
the interrupt handlers by Commodore obviously do that correctly.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Thu Jul 17 18:27:26 2025

On Thu, 17 Jul 2025 18:03:25 +0000, John Savard wrote:

As you noted that having multiple register files makes context switching slower, though, I was simply noting that... one can always just ignore
the extra register files.

And speaking of ignoring features:

You've noted that it doesn't really make sense to include Cray-style
vector capabilities on a modern microprocessor, since they require
large memory bandwidth to be effective.

Since I'm designing the Concertina II ISA to support every need, I
do include such capabilities in my Long Vector instructions.

A vector register, in my architecture, consists of storage for
64 double-precision floating-point numbers. Those are certainly a
big pain to save and restore.

So what I envisage is, even in implementations of the ISA which don't
subset out the long vector feature, there will only be one set of
long vector registers in any core, and the operating system will
disable the feature by default. Programs will have to request use of
the long vector capability, and those programs will, basically, run
in batch mode (they can still communicate with the user, if they take
over the computer, i.e. as a video game) - just one such program runs
at a time, to reduce the need for saving and restoring all those
registers.

The ISA will include a number of features with the characteristic that
they are capable of improving performance when used appropriately - and
that they will massively degrade performance if used badly.

I consider that sort of thing the programmer's problem. I'm looking more
at designing an ISA that can be used to make supercomputers than an ISA
that is idiot proof. Particularly as an old saying claims that doing the
latter is a losing battle: they'll always design better idiots.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Savard on Thu Jul 17 12:20:18 2025

On 7/17/2025 11:03 AM, John Savard wrote:

On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:

On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

You could have an operating system that neglects to save certain
register files on interrupts, which means programs can't use them.
(There's a precedent: the Commodore 64 didn't save the status bit for
decimal mode,
so user programs couldn't use that feature of the 6502.)

Any system that is not completely transparent to interrupts is a pain in
the a$$ for user applications.

In no way am I denying this.

As you noted that having multiple register files makes context switching slower, though, I was simply noting that... one can always just ignore
the extra register files. Making them useless, by not saving them
and restoring them in the interrupt handler, forces user programs to
avoid using those registers. And if they can't use them, they don't have
to save and restore them, and so context switching is speeded up!

John,
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?

An ISA can be implemented in different ways, and it can be used in
different ways.

Of course.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jul 17 19:21:48 2025

On Thu, 17 Jul 2025 16:28:56 +0000, Anton Ertl wrote:

John Savard <[email protected]d> writes:

With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.

In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO.

Very mythical.

No VAX (*the* CISC at the time of RISC development) implementation had
OoO, ever.

No IA-32 (CISC) implementation had OoO up to October 31, 1995, i.e.,
during the first 10 years of IA-32's existence.

HP-PA (RISC) had an OoO implementation since November 2, 1995, i.e.,
one day after IA-32.

MIPS (RISC) had an OoO implementation since January 1996, i.e., two
months after IA-32.

Since the CISC chips had register files of 8
registers,

VAX (CISC) has 15 GPRs. S/360 (CISC) has 16 GPRs.

and the RISC chips had register files of 32 registers,

ARM A32 hast 16 registers.

(Given cache misses,
maybe the RISC chips still needed scoreboards.)

Apart from MIPS R2000/R3000, every RISC has waited for results to
become ready. In the beginning stopping the pipeline was a way to do
this, but once more silicon became available and performance demands increased, other instructions often continued as far as possible.
That was often called scoreboarding, but Mitch Alsup tells us that scoreboarding on the 66000 was something more sophisticated.

CDC 6600 (no third 0) had a SB that scheduled the start of an
instruction
(at register read) and then later scheduled the completion of an inst-
ruction (register write) in a way that prevented RAW, WAR, and WAW
hazards.

Renaming gets rid of the xAW hazards in SBs, Reservation Stations,
Dispatch stacks, and others.

And then the RISC chips got "normal" OoO, and to keep up, the
CISC chips got "big" OoO.

It would be an interesting task to unearth some PA-8000 or R10000
machine, and use Henry Wong's (IIRC) methodology to determine the
reorder buffer size and the renaming capacity of these chips. For the Coppermine Pentium III the reorder buffer size is 40 entries.

If things have progressed further, so that RISC chips have "big"
OoO and CISC chips have "great big" OoO

You can find my collection of sizes of OoO resources on

https://www.complang.tuwien.ac.at/anton/robsize/

If we look in the year 2020, we see the following number of physical
GPRs:

192 Zen3 (AMD64, CISC)
280 Sunny/Willo Cove (AMD64, CISC)
354 Apple M1 Firestorm (ARM A64, RISC)
192 Samsung M5 (ARM A64, RISC)

So there does not seem to be a systematic difference in the number of physical GPRs between CISC and RISC in recent years.

Agreed, you add physical registers and execution window space until
you run out of a) area, or b) power, or c) (less likely) pipeline
stages.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Jul 17 19:52:05 2025

On Thu, 17 Jul 2025 19:20:18 +0000, Stephen Fuld wrote:

On 7/17/2025 11:03 AM, John Savard wrote:

On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:

On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

You could have an operating system that neglects to save certain
register files on interrupts, which means programs can't use them.
(There's a precedent: the Commodore 64 didn't save the status bit for
decimal mode,
so user programs couldn't use that feature of the 6502.)

Any system that is not completely transparent to interrupts is a pain in >>> the a$$ for user applications.

In no way am I denying this.

As you noted that having multiple register files makes context switching
slower, though, I was simply noting that... one can always just ignore
the extra register files. Making them useless, by not saving them
and restoring them in the interrupt handler, forces user programs to
avoid using those registers. And if they can't use them, they don't have
to save and restore them, and so context switching is speeded up!

John,
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?

While waiting for John's response::

Combined registers WITH universal constants is better* than separate
register files WITHOUT universal constants.

(*) maybe "not worse than on average" is more accurate than "better
than on average". One can find cases that go either way.

History:: Circa 1999-2004 K9 project at AMD, we discovered that the
x86-64 16 register file, the x87 Register File, and the (then) MMX
(soon to be) AVX register fil could all be serviced by a single
rename register pool. Which is what we did. THEN to deal with "all
those FUs" we designed the data path as a short path {integer, logical,
AGEN, Shift} and a long path {IMUL, IDIV, all multicycle FP, all SIMD}.
The long path was 1 cycle farther down the pipeline (wire delay) so
we had those FUs send the result tags 1-cycle earlier, and had 2-stages
of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
same reason you can put any kind of data in the data cache !!! The
unified renamer ran out of registers a LOT less often than partitioned
renamer.

The long data path added less than 1% overhead to instruction execution
latency (as measured on 5,000-odd traces of 1B instructions each from
Opteron).

An ISA can be implemented in different ways, and it can be used in
different ways.

Of course.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 20:04:28 2025

On Thu, 17 Jul 2025 18:27:26 +0000, John Savard wrote:

On Thu, 17 Jul 2025 18:03:25 +0000, John Savard wrote:

As you noted that having multiple register files makes context switching
slower, though, I was simply noting that... one can always just ignore
the extra register files.

And speaking of ignoring features:

You've noted that it doesn't really make sense to include Cray-style
vector capabilities on a modern microprocessor, since they require
large memory bandwidth to be effective.

Since I'm designing the Concertina II ISA to support every need, I
do include such capabilities in my Long Vector instructions.

A vector register, in my architecture, consists of storage for
64 double-precision floating-point numbers. Those are certainly a
big pain to save and restore.

The problem with CRAY-like vector architectures is that you have
to fundamentally build the memory units to send 2 new fetches
and 1 write beyond L1 cache every cycle. That is the L2 has to
service 2 LD misses and 1 ST miss every cycle continuously for-
ever !!! and so does the Bus interface !!! and so does DRAM !!!

If you can do the above, the overhead of LDin and STing the file
becomes inconsequential.

Few can afford the costs to build a CRAY-like vector machine
more due to pin counts, interconnect, and DRAM memory; than CPU
internals.

So what I envisage is, even in implementations of the ISA which don't
subset out the long vector feature, there will only be one set of
long vector registers in any core, and the operating system will
disable the feature by default. Programs will have to request use of
the long vector capability, and those programs will, basically, run
in batch mode (they can still communicate with the user, if they take
over the computer, i.e. as a video game) - just one such program runs
at a time, to reduce the need for saving and restoring all those
registers.

What do you do with program that request said feature and it is NOT
PRESENT !! (That is compatible up and down)

The ISA will include a number of features with the characteristic that
they are capable of improving performance when used appropriately - and
that they will massively degrade performance if used badly.

I consider that sort of thing the programmer's problem.

Like putting a driver who has never experienced anything but a road
car in a rally car where each circuit is on its own switch and relay
so the (crashed) car can still finish the race.

I'm looking more
at designing an ISA that can be used to make supercomputers than an ISA
that is idiot proof.

You cannot make anything that is Turing complete that is also idiot proof--idiots are more clever than that.

Particularly as an old saying claims that doing the latter is a losing battle: they'll always design better idiots.

Idiots will stumble upon ways you never thought of.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Stephen Fuld on Thu Jul 17 20:32:44 2025

On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?

I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.

Having separate integer and floating-point registers?

- that is what nearly everyone else does;

- integers and floating-point numbers are different in format, so
it is not useful to perform operations meant for one type of number
on the other type of number;

- the opcode indicates whether an operation is an integer operation
or a floating-point operation, so having these two sets of registers
lets you have twice as many registers without having to add a bit
to the register fields in the instruction.

The third point, of course, is the only _real_ advantage.

*And* my architecture is specified as performing a transformation on floating-point numbers during load and store operations to make register-to-register arithmetic faster.

So I'm taking advantage of the need for separate floating-point
load and store instructions to derive a performance gain from
them. (The problem is that the hidden first bit and denormals,
if properly handled, only involve a small number of gate delays,
so this is unlikely to produce a genuine advantage.)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu Jul 17 21:00:36 2025

On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:

Few can afford the costs to build a CRAY-like vector machine more due to
pin counts, interconnect, and DRAM memory; than CPU internals.

I do remember seeing a photo of the CPU from the NEC SX6;
it was connected to a memory bus that was sixteen times as
wide as a normal PC memory bus.

Using dual-channel or quad-channel memory on high-performance
PCs is quite conventional these days.

So, while I agree that doing what the SX6 did is not
likely to be feasible except in extreme cases, splitting
the difference and using an eight-channel interface is
something I suspect would be doable.

Thus, I figure that *half* the performance of a NEC Aurora
TSUBASA would already be good enough to be a big improvement
over conventional microprocessors.

Initially, before having your input, the reason I started putting
in a CRAY-like vector feature in the original Concertina design,
aside from simply wanting to explain how it worked, and how it
differed from modern SIMD vector designs, was that...

microprocessors seemed to have evolved from 8-bit designs to
minicomputer-like designs to mainframe-like designs. The
Pentium Pro and Pentium II took this to the limit of mainframe-like
designs, by attaining a performance-oriented design strongly
resembling the IBM System/360 Model 195.

What else did computers, prior to the microcomputer era do to be
more powerful? What else remained for further progress? Well, there
was _one_ old computer that went beyond the 195... the CRAY-I and
those which followed it.

So it seemed like there was still one step to take in making individual
cores more powerful before going to the expedient of putting a
parallel sysplex system on a chip (IBM-speak for multicore).

That was naive reasoning, of course, so I can't really mount a
strong counterargument against your claims that this is not workable.

After all, although NEC _has_ managed to continue to make a
vector supercomputer design for today, it's in a video card style
form factor instead of being a CPU that sits directly on top of
a motherboard.

What do you do with program that request said feature and it is NOT
PRESENT !! (That is compatible up and down)

Obviously, there is _nothing_ that can be done with programs that
require a feature that isn't implemented which is fully
upwards and downwards compatible. For full compatibility, every
feature must be implemented.

The Concertina II ISA, however, is, as has been noted, rather
bloated. So I don't see it as at all unreasonable to divide the
architecture into a basic architecture, which all programs can
expect to have available, and specialized features which are
only present on special-purpose implementations.

Just as you wouldn't write code for a Cray, or a TMS320C6000,
and expect it to run on an IBM 360, the special-purpose
implementations of Concertina II are different enough in
function that they should be regarded as different machines -
even if they are also able to run standard programs for the
architecture.

The option, though, is also present to implement all features
for compatibility, but implement some features badly. So
some register files are too enormous to put on the chip? Fine;
put them in RAM!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 21:29:57 2025

On Thu, 17 Jul 2025 21:00:36 +0000, John Savard wrote:

On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:

Few can afford the costs to build a CRAY-like vector machine more due to
pin counts, interconnect, and DRAM memory; than CPU internals.

I do remember seeing a photo of the CPU from the NEC SX6;
it was connected to a memory bus that was sixteen times as
wide as a normal PC memory bus.

SX6 had 256-way interleaved memory with 2-stages of 16-ways each
both outgoing and incoming.

IIRC it was 4-lanes wide, which means each CPU port (2LD and 1ST)
were equivalent to 8 LDs per cycle and 4 STs per cycle, and it
did not have a data cache.

Using dual-channel or quad-channel memory on high-performance
PCs is quite conventional these days.

Fully 2 decimal orders of magnitude too small.

So, while I agree that doing what the SX6 did is not
likely to be feasible except in extreme cases, splitting
the difference and using an eight-channel interface is
something I suspect would be doable.

Thus, I figure that *half* the performance of a NEC Aurora
TSUBASA would already be good enough to be a big improvement
over conventional microprocessors.

Initially, before having your input, the reason I started putting
in a CRAY-like vector feature in the original Concertina design,
aside from simply wanting to explain how it worked, and how it
differed from modern SIMD vector designs, was that...

microprocessors seemed to have evolved from 8-bit designs to minicomputer-like designs to mainframe-like designs. The
Pentium Pro and Pentium II took this to the limit of mainframe-like
designs, by attaining a performance-oriented design strongly
resembling the IBM System/360 Model 195.

Leaving "mainframes" to support features micros did not::
mainly things like RAS, hot-plug, and uptime measured in
years to decades without a crash.

What else did computers, prior to the microcomputer era do to be
more powerful? What else remained for further progress? Well, there
was _one_ old computer that went beyond the 195... the CRAY-I and
those which followed it.

CRAY-like is the end of the In-Order evolution path*. Instructions
are in-order, calculations are in-order/FU, memory is in-order/bank;
and the rest is your problem.

(*) discounting Mill.

CDC 6600 was the beginning of the OoO path (discounting Stretch)
but /91 introduce the technology to follow forward (Tomasulo).
Patterson (et. al) developed Reorder Buffer and we have the modern
design paradigm.

The rest was widening the various paths (FETCH, DECODE, Execution Lanes)
and making caches as big as cycle times could afford.

So it seemed like there was still one step to take in making individual
cores more powerful before going to the expedient of putting a
parallel sysplex system on a chip (IBM-speak for multicore).

That was naive reasoning, of course, so I can't really mount a
strong counterargument against your claims that this is not workable.

After all, although NEC _has_ managed to continue to make a
vector supercomputer design for today, it's in a video card style
form factor instead of being a CPU that sits directly on top of
a motherboard.

Nor does it need 25 tons of cooling.

What do you do with program that request said feature and it is NOT
PRESENT !! (That is compatible up and down)

Obviously, there is _nothing_ that can be done with programs that
require a feature that isn't implemented which is fully
upwards and downwards compatible. For full compatibility, every
feature must be implemented.

The Concertina II ISA, however, is, as has been noted, rather
bloated. So I don't see it as at all unreasonable to divide the
architecture into a basic architecture, which all programs can
expect to have available, and specialized features which are
only present on special-purpose implementations.

Just as you wouldn't write code for a Cray, or a TMS320C6000,
and expect it to run on an IBM 360, the special-purpose
implementations of Concertina II are different enough in
function that they should be regarded as different machines -
even if they are also able to run standard programs for the
architecture.

The option, though, is also present to implement all features
for compatibility, but implement some features badly. So
some register files are too enormous to put on the chip? Fine;
put them in RAM!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Thu Jul 17 21:26:52 2025

On Thu, 17 Jul 2025 21:00:36 +0000, John Savard wrote:

On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:

Few can afford the costs to build a CRAY-like vector machine more due
to pin counts, interconnect, and DRAM memory; than CPU internals.

Initially, before having your input, the reason I started putting in a CRAY-like vector feature in the original Concertina design,
aside from simply wanting to explain how it worked, and how it differed
from modern SIMD vector designs, was that...

microprocessors seemed to have evolved from 8-bit designs to minicomputer-like designs to mainframe-like designs. The Pentium Pro and Pentium II took this to the limit of mainframe-like designs, by
attaining a performance-oriented design strongly resembling the IBM System/360 Model 195.

What else did computers, prior to the microcomputer era do to be more powerful? What else remained for further progress? Well, there was _one_
old computer that went beyond the 195... the CRAY-I and those which
followed it.

And I should mention another reason why I seem to be resistant to
heeding your undoubtedly good advice.

The very reason the CRAY-I succeeded where the STAR-100 failed was that
the CRAY-I was built around vector registers, and it did its
arithmetic between those vector registers, only loading data into
them from, and writing data out from them to, the main memory.

In true RISC fashion.

So the CRAY-I was a design *specifically intended* to make wise and
sparing use of limited memory bandwidth.

So when you come along and tell me that a CRAY-I style design can't
possibly work without ginormous memory bandwidth, something in my
gut just rebels at the very thought.

Of course, the above does _not_ mean that you're wrong. The
CRAY-I and the STAR 100 were both machines from the old days,
before today's wide disparity between CPU speeds and memory
speeds emerged. So if the CRAY-I managed to just sneak under the
bar of what was needed to make memory *then* adequate to feed
CPUs *then*, while the STAR 100 failed at that... then it _is_
entirely reasonable to conclude that memory *now* is vastly
inadequate to feed even a CRAY-I style CPU *now*, even if a
STAR 100 style machine would be even worse.

And the fact that a modern microprocessor can have a cache which
is as big as the entire main memory of the CRAY-I is... not a
defense when you consider that a smartphone today has more FLOPS
than a CRAY-I did. In order for a CRAY-I like design to actually
improve on the performance of a desktop CPU, it actually does
have to have a balanced memory design.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu Jul 17 21:45:54 2025

On Thu, 17 Jul 2025 21:29:57 +0000, MitchAlsup1 wrote:

On Thu, 17 Jul 2025 21:00:36 +0000, John Savard wrote:

Using dual-channel or quad-channel memory on high-performance PCs is
quite conventional these days.

Fully 2 decimal orders of magnitude too small.

So, while I agree that doing what the SX6 did is not likely to be
feasible except in extreme cases, splitting the difference and using an
eight-channel interface is something I suspect would be doable.

Obviously, "twice as fast" isn't enough if 100 times as fast is what
is needed.

I was coming back to mention, hey, we've also got HBM, but
presumably even that will fall short. Although HBM is presumably
what the NEC SX-Alpha TSUBASA is doing.

Their cards have either 24 or 48 gigabytes of internal memory;
24 is for development systems, 48 for real work.

What I would see as "doable" is 8 or 16 gigabytes of HBM on a
chip module with an eight-channel interface to main memory.

That would not be cheap. That would not match the performance
of a NEC SX-Alpha TSUBASA. But it would outdo an ordinary
desktop CPU.

Also, in reply to another comment you made which I didn't
quote:

Obviously, there is no such thing as a _rename_ vector register,
so, yes, the vector engine would be in-order. One of the big
strengths of the CRAY-I, as opposed to the STAR 100 and its
other competition at the time was that Seymour Cray understood
Amdahl's Law well enough to know that he couldn't neglect the
performance of the scalar part of his vector machine.

So the fact that an in-order CRAY-I style vector unit would be
bolted on to a GBOoO scalar CPU is... accepted as inevitable,
rather than seen as a contradiction, at least by this unworthy
one.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Thu Jul 17 21:58:23 2025

On Thu, 17 Jul 2025 21:45:54 +0000, John Savard wrote:

So the fact that an in-order CRAY-I style vector unit would be bolted on
to a GBOoO scalar CPU is... accepted as inevitable, rather than seen as
a contradiction, at least by this unworthy one.

And *if*, as I very strongly doubt, there was some reason why an
out-of-order scalar unit would not mesh well with an in-order
vector unit, with pipeline delays or something keeping the vector
unit waiting too long for scalar results...

have you forgotten that the Concertina II ISA _also_ includes those
register banks with 128 registers in them, and those VLIW block formats
with break bits and instruction predication, so as to attempt to
approach OoO levels of performance within in-order designs?

I am not really attempting to compete with Mill. I don't consider
myself at all qualified to do so, and I am in awe at its amazing
originality. What Concertina II does instead is approach in-order
performance through well-known conventional means, without any
originality.

So if CRAY-I style vectors need high-performance *in-order* scalar
arithmetic to complement them for some obscure reason of which I
am not aware... Concertina II is ready!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Savard on Thu Jul 17 21:35:31 2025

On 7/17/2025 1:32 PM, John Savard wrote:

On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?

I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.

Sorry, you are right. So you now have four sets of registers (integer,
float, SIMD, additional)?

That's a lot of chip area and wiring.

Having separate integer and floating-point registers?

- that is what nearly everyone else does;

So if everyone else jumped off a roof????

- integers and floating-point numbers are different in format, so
it is not useful to perform operations meant for one type of number
on the other type of number;

Not necessarily. There are things that one wants to do to floating
point numbers that are "integer register like", such as extract the
exponent. I think in a related post, someone (Mitch?) gave a more
complete list. So you either have to provide extra instructions to do
these things on the FP registers, or suffer the cost of instructions to
move the value from the FP registers to the integer registers.

- the opcode indicates whether an operation is an integer operation
or a floating-point operation, so having these two sets of registers
lets you have twice as many registers without having to add a bit
to the register fields in the instruction.

True. The question is, how valuable those extra registers are? If you
already have 32 integer registers, isn't that enough for almost every
purpose?

The third point, of course, is the only _real_ advantage.

*And* my architecture is specified as performing a transformation on floating-point numbers during load and store operations to make register-to-register arithmetic faster.

Yes, I had forgotten about that.

So I'm taking advantage of the need for separate floating-point
load and store instructions to derive a performance gain from
them. (The problem is that the hidden first bit and denormals,
if properly handled, only involve a small number of gate delays,
so this is unlikely to produce a genuine advantage.)

OK, so scratch that advantage. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Stephen Fuld on Fri Jul 18 11:18:32 2025

On Thu, 17 Jul 2025 21:35:31 -0700, Stephen Fuld wrote:

On 7/17/2025 1:32 PM, John Savard wrote:

Having separate integer and floating-point registers?

- that is what nearly everyone else does;

So if everyone else jumped off a roof????

Then it would be more obvious they were making a mistake. Usually, if not always, what most people do, they do for a reason. And usually, if not
always, that reason is actually good.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Stephen Fuld on Fri Jul 18 11:28:56 2025

On Thu, 17 Jul 2025 21:35:31 -0700, Stephen Fuld wrote:

That's a lot of chip area and wiring.

Yes, but then so is out-of-order execution. Having sets of 128 registers
as register banks gives designers the option of dropping OoO and still
having performance.

I mean, that was the idea behind the Itanium!

Oh, sorry, that may not be a recommendation. :)

OK, so scratch that advantage. :-(

Not so fast. A small number of gate delays may not seem like much.
But if one is desperate to make floating-point operations as fast
as possible in any way one can (I also plan to drop the demand of the
IEEE 754 standard that division always produce the best rounded
result, in order to speed up floating-point division by methods such
as Goldschmidt and Newton-Raphson) then, depending on how many gate
delays per cycle the rest of one's design leads to, those few gate
delays just might shave off a cycle somewhere.

"desperate to make floating-point operations as fast as possible"?

Yes, I'm channelling the ghost of Seymour Cray in more than one way.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to John Savard on Fri Jul 18 11:33:07 2025

On Fri, 18 Jul 2025 11:28:56 +0000, John Savard wrote:

(I also plan to drop the demand of the IEEE
754 standard that division always produce the best rounded result, in
order to speed up floating-point division by methods such as Goldschmidt
and Newton-Raphson)

So I don't _always_ do what everyone else does.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Fri Jul 18 15:12:27 2025

On Fri, 18 Jul 2025 4:35:31 +0000, Stephen Fuld wrote:

On 7/17/2025 1:32 PM, John Savard wrote:

On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?

I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.

Sorry, you are right. So you now have four sets of registers (integer, float, SIMD, additional)?

That's a lot of chip area and wiring.

Having separate integer and floating-point registers?

- that is what nearly everyone else does;

So if everyone else jumped off a roof????

Where is Jim Jones when we need him ?!?

- integers and floating-point numbers are different in format, so
it is not useful to perform operations meant for one type of number
on the other type of number;

Not necessarily. There are things that one wants to do to floating
point numbers that are "integer register like", such as extract the
exponent. I think in a related post, someone (Mitch?) gave a more

Yes, it was me.

complete list. So you either have to provide extra instructions to do
these things on the FP registers, or suffer the cost of instructions to
move the value from the FP registers to the integer registers.

Multiply by power of 2 is integer add to exponent
Divide by power of 2 is integer subtract from exponent
Copysign is essentially a MOV where 1 bit comes from S1 the rest come
... S2
Exponent is Shift down by fraction size, mask off sign, and debias
Fraction is mask of sign and exponent and create hidden (expon!=0)
Split is mask off 1/2 the fraction bits and FSUB

You see there are a lot of logical and shifting going on here,
and a bit of integer.

So if you have all the logical, integer + and - and shifts--you
basically
have a complete integer FU. So, then, you are in the position to use
the FP file as an extended integer file. And at that point, what did
having a separate file BUY ?!? What you really have is a unified FP
file and a degenerate integer file.

AND THEN there is IMUL and IDIV--these are fairly easy to perform
over in the FMAC, so now you need a path from IRF to FMAC, and
you end up doing almost everything in the FPU !!! So, why do this
to yourself ??

AND THEN there is context switching ,...
AND THEN there is the OS wanting to use SIMD for page-sized MOVes...

- the opcode indicates whether an operation is an integer operation
or a floating-point operation, so having these two sets of registers
lets you have twice as many registers without having to add a bit
to the register fields in the instruction.

True. The question is, how valuable those extra registers are? If you already have 32 integer registers, isn't that enough for almost every purpose?

The third point, of course, is the only _real_ advantage.

*And* my architecture is specified as performing a transformation on
floating-point numbers during load and store operations to make
register-to-register arithmetic faster.

K6 and K7 (Athlon) did those kinds of things--we took that out in
Opteron due to lots of boundary conditions (especially MMX and SSE
stuff).

Yes, I had forgotten about that.

So I'm taking advantage of the need for separate floating-point
load and store instructions to derive a performance gain from
them. (The problem is that the hidden first bit and denormals,
if properly handled, only involve a small number of gate delays,
so this is unlikely to produce a genuine advantage.)

You will be surprised during verification.

OK, so scratch that advantage. :-(

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to John Savard on Fri Jul 18 15:39:58 2025

John Savard <[email protected]d> wrote:

On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:

Few can afford the costs to build a CRAY-like vector machine more due to
pin counts, interconnect, and DRAM memory; than CPU internals.

I do remember seeing a photo of the CPU from the NEC SX6;
it was connected to a memory bus that was sixteen times as
wide as a normal PC memory bus.

Using dual-channel or quad-channel memory on high-performance
PCs is quite conventional these days.

So, while I agree that doing what the SX6 did is not
likely to be feasible except in extreme cases, splitting
the difference and using an eight-channel interface is
something I suspect would be doable.

Thus, I figure that *half* the performance of a NEC Aurora
TSUBASA would already be good enough to be a big improvement
over conventional microprocessors.

IIUC on modern machine AVX instructions can saturate L1 cache.
And when you do multiply-and-add they use all available power.
So, this looks very well balanced, to get any improvement in
performance you need both more power efficient execution units
(which probably means lowering the clock, increasing latency and
compensating by having more execution units) and more L1 bandwidth.
More L1 bandwidth probably also means more latency. One can
probably get some extra performance when computation fits within
register set. But current AVX register set is probably limit
of what is possible: more registers set means more latency on
access, bigger registers need more wires and more area which
means longer wires which also leads to slower access.

Once your computation needs data from deeper level of cache
hierarchy you have plenty of compute power but are limited by
data access (both bandwidth and latency).

So, simply there is no way long vector register could help.

You could try some GPU tricks, but usefuly blending GPU and
CPU looks tricky: GPU wants to run at relativly low clock
frequency, slowing down CPU clock would severly limit single
thread performance. High-end GPU-s use special memory which
offers higher bandwidth, but has lower capacity than normal
memory.

What do you do with program that request said feature and it is NOT
PRESENT !! (That is compatible up and down)

Obviously, there is _nothing_ that can be done with programs that
require a feature that isn't implemented which is fully
upwards and downwards compatible. For full compatibility, every
feature must be implemented.

The Concertina II ISA, however, is, as has been noted, rather
bloated. So I don't see it as at all unreasonable to divide the
architecture into a basic architecture, which all programs can
expect to have available, and specialized features which are
only present on special-purpose implementations.

Just as you wouldn't write code for a Cray, or a TMS320C6000,
and expect it to run on an IBM 360, the special-purpose
implementations of Concertina II are different enough in
function that they should be regarded as different machines -
even if they are also able to run standard programs for the
architecture.

Well, you spend a lot of effort to put all those instructions
into a single instruction set. But then you split into
mutually incompatible subsets. It looks that designing
your instruction subsets as semi independent would allow better
design for each individual subset. What is a gain from
single unified instruction set if you do not want to
implement all of it, but only subsets?

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to John Savard on Fri Jul 18 08:46:43 2025

On 7/17/2025 1:32 PM, John Savard wrote:

On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?

I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.

Having separate integer and floating-point registers?

- that is what nearly everyone else does;

Although I gave a flip, and snarky, response to this before, I got to
thinking about why it is true. I came up with the following, though it
it a quick and dirty answer and I welcome other's comments and corrections.

First of all, it isn't true. Going back to the mainframe era, AFAIK, of
the major manufacturers, only IBM (S/360) had separate FP registers,
Univac, CDC and both Burroughs architectures did not.

I posit that the driving factor in the decision to have separate FP
registers was the decision to make FP instructions an optional feature,
i.e. an optional feature in S/360, part of the basic architecture in the others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.

I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies. But in the early microprocessor era, it was clear that due to chip limitations, FP had to
be on a separate chip e.g. 8087, and the cost of crossing a chip
boundary several times for each FP instruction, which would have been
needed if there were no on FPU separate registers, would not have been practical. Backward compatibility led to this decision being
promulgated to future generations. I guess Intel could have changed
when they added the non 8087 FP instructions, but by then the mind was
set. Also, the X86 had fewer "normal" registers, so the argument about
more registers had some weight. With new, clean sheet designs this is
much less of an issue. And even for clean sheet designs, if it is
desired to have FP an optional feature, that feature would not be a
separate chip, thus eliminating that motivation for separate registers

So I believe that the arguments for separate FP registers, while once
valid, are no longer so.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to [email protected] on Fri Jul 18 16:10:29 2025

MitchAlsup1 <[email protected]> wrote:

On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

Well, then, the extended register banks with 128 registers in them...

AMD 29K ?!?

are there to be used by programs intended to run on chips that don't
have OoO. If an implementation does have OoO, nothing is to be gained
by bothering with those registers (which still won't have rename
registers associated with them, even in an OoO implementation; so
OoO won't work on the parts of the program that use them).

Somewhat right:: given a pool of 128 (or 256) rename registers,
one can make even an 8-register machine run fast and rather
well. The thing is that a 32-register ISA has a 15%-18% speed
advantage over an 8-register machine--whereas a 64 register
machine only has a 3% speed advantage over a 32 register
machine (MIPS circa 1982). At some point not having enough
registers hurts (that somewhere is in the mid 20-s of registers)
and at some point the size of the RF limits read performance
(that somewhere is between 32 registers and 64 registers).

Hmm, IIUC for read performance what matters is physical reqister
size. So if machine has 128 physical registers and 64 architectural
requaters read access time should be essentially the same as
machine which has 128 physical registers and 8 architectural
registers. So with renaming cost of large architectural
register set is in different place than access to register
file. I guess that one cost of larger register set is
space taken by register specification in instructions.
There is cost in renamer (but it is not clear to me how
significanat this is). With large architectural register
set there may be more pressure on physical registers, because
architectural register may longer keep dead values whithout
CPU knowing this. There is cost for context switches and
possible for function calls (more potentially live values
which need saving).

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Jul 18 12:29:18 2025

MitchAlsup1 wrote:

On Thu, 17 Jul 2025 16:28:56 +0000, Anton Ertl wrote:

John Savard <[email protected]d> writes:

With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.

In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO.

Very mythical.

No VAX (*the* CISC at the time of RISC development) implementation had
OoO, ever.

No IA-32 (CISC) implementation had OoO up to October 31, 1995, i.e.,
during the first 10 years of IA-32's existence.

HP-PA (RISC) had an OoO implementation since November 2, 1995, i.e.,
one day after IA-32.

MIPS (RISC) had an OoO implementation since January 1996, i.e., two
months after IA-32.

Since the CISC chips had register files of 8
registers,

VAX (CISC) has 15 GPRs. S/360 (CISC) has 16 GPRs.

and the RISC chips had register files of 32 registers,

ARM A32 hast 16 registers.

(Given cache misses,
maybe the RISC chips still needed scoreboards.)

Apart from MIPS R2000/R3000, every RISC has waited for results to
become ready. In the beginning stopping the pipeline was a way to do
this, but once more silicon became available and performance demands
increased, other instructions often continued as far as possible.
That was often called scoreboarding, but Mitch Alsup tells us that
scoreboarding on the 66000 was something more sophisticated.

CDC 6600 (no third 0) had a SB that scheduled the start of an
instruction
(at register read) and then later scheduled the completion of an inst- ruction (register write) in a way that prevented RAW, WAR, and WAW
hazards.

CDC6600 scoreboard gets rid of them by serializing conflicting register accesses to the same register. SB ordering respects dataflow dependencies
not program ordering and so does not inherently enforce precise exceptions.

6600 has no OoO load/store queue - its "stunt box" orders memory ops to
the same location but it only works after addresses have been written
into the 8 A registers. Since loading A registers would be data flow
dependent (as decided by the scoreboard) and not program order dependent
(as decided by an LSQ) I suspect one could break its memory ordering model
by simply causing loads of the same address into different A registers
to occur out of order with different address calculation delays.

Renaming gets rid of the xAW hazards in SBs, Reservation Stations,
Dispatch stacks, and others.

Rename with a separate Retire allows accesses to be performed concurrently
in any order while respecting precise exceptions.
LSQ ensures memory ops appear in program order.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Waldek Hebisch@21:1/5 to [email protected] on Fri Jul 18 16:37:54 2025

MitchAlsup1 <[email protected]> wrote:

Point of order:: all register files that have the same width (64-bits)
should be a single file. This makes varargs easier, allows using integer operations on FP operands (extract exponent, insert exponent, copysign)
which are mandated by the standards. Either you have an integer set of registers and a FP set of registers and a nearly complete set of integer operations on FP registers, or you can dispense with the nonsense and
have a single general purpose register file.

I have evidence (data) indicating My 66000 with only 32-registers
AND universal constants needs fewer registers than RISC-V with
32 integer and 32 FP registers on many applications, including
those you think need 32+32.

Does this matter much for maching doing register renaming? IIUC
machine may have separate architectural register sets but
renamer can allocate registers from a single set. And opposite:
machine may have unified architectural register set but renamer
may allocate from separate sets, depending on instruction.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Fri Jul 18 16:25:37 2025

Stephen Fuld <[email protected]d> writes:

On 7/17/2025 1:32 PM, John Savard wrote:

On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?

I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.

Having separate integer and floating-point registers?

- that is what nearly everyone else does;

Although I gave a flip, and snarky, response to this before, I got to >thinking about why it is true. I came up with the following, though it
it a quick and dirty answer and I welcome other's comments and corrections.

First of all, it isn't true. Going back to the mainframe era, AFAIK, of
the major manufacturers, only IBM (S/360) had separate FP registers,
Univac, CDC and both Burroughs architectures did not.

Burroughs medium systems had an accumulator register for floating
point[*]. non-floating point arithmetic was done memory-to-memory
without registers.

[*] Starting with the B4700. The B3500 supported memory-to-memory
floating point operations with up to 100 digits of mantissa (2 digit exponent).
The B4700 accumulator supported 20 digit mantissa with 2-digit exponent,
which was present in every subsequent generation through the V560.

Burroughs large systems held all operands on the stack (although
there were internal "registers" for top few stack elements).

There were other burroughs processors (B1900, 220, B300, B800, B900, et alia), but I'm not familiar with the details thereof, although I believe the
B900 series supported floating point. The B1900 had writable control store, which could be swapped at context switch, so each supported language had
its own instruction set.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Waldek Hebisch on Fri Jul 18 17:08:57 2025

On Fri, 18 Jul 2025 15:39:58 +0000, Waldek Hebisch wrote:

What
is a gain from single unified instruction set if you do not want to
implement all of it, but only subsets?

This is a valid point, but then I need to clarify one important thing.

I don't want to implement _only_ subsets. Implementing a subset is simply
an option. A very useful option for many applications. But implementing
the whole instruction set for a desktop PC chip is appropriate.

Also, if only subsets were implemented, programs written in the basic instruction set common to all the subsets would still run on all the
chips, so that means that there is a gain even in the "only subsets" case.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Stephen Fuld on Fri Jul 18 17:14:08 2025

On Fri, 18 Jul 2025 08:46:43 -0700, Stephen Fuld wrote:

I posit that the driving factor in the decision to have separate FP
registers was the decision to make FP instructions an optional feature,
i.e. an optional feature in S/360, part of the basic architecture in the others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.

And assuming that to be the case, there's a smoking gun in the original System/360 architecture.

The original System/360 had only four floating-point registers. These
registers weren't numbered 0, 1, 2, and 3... and they weren't numbered
0, 4, 8 and 12 either.

They were numbered 0, 2, 4 and 6.

The floating-point registers on the System/360 were 64 bits long, while
the general registers were 32 bits long.

This could suggest that the decision to make the floating-point unit
an option, with its own set of registers, instead of using pairs of
general registers for floating-point numbers, cama along late in the
design process in that case.

I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies.

Well, since minicomputers are smaller and cheaper, nearly all of them
only had floating-point as an optional feature, except perhaps for
some machines classed as superminis.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Stephen Fuld on Fri Jul 18 16:24:34 2025

Stephen Fuld <[email protected]d> writes:

On 7/17/2025 1:32 PM, John Savard wrote:
First of all, it isn't true. Going back to the mainframe era, AFAIK, of
the major manufacturers, only IBM (S/360) had separate FP registers,
Univac, CDC and both Burroughs architectures did not.

In the CDC 6600, the A and B registers correspond to GPRs (they
support addresses), while the X registers correspond to FP registers;
they may also support integer operations, but not addresses.

I posit that the driving factor in the decision to have separate FP
registers was the decision to make FP instructions an optional feature,
i.e. an optional feature in S/360, part of the basic architecture in the >others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.

I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies. But in the early >microprocessor era, it was clear that due to chip limitations, FP had to
be on a separate chip e.g. 8087, and the cost of crossing a chip
boundary several times for each FP instruction, which would have been
needed if there were no on FPU separate registers, would not have been >practical.

The 8087 was so slow that the cost of moving stuff over would have
been only a small fraction of the total time. However, the 8086 has 8
16-bit registers, not enough to hold even two 80-bit numbers for the
8087.

Backward compatibility led to this decision being
promulgated to future generations. I guess Intel could have changed
when they added the non 8087 FP instructions, but by then the mind was
set.

You mean SSE and SSE2? Note that the XMM registers of SSEx are 128
bits in size, while the GPRs of IA-32 are 32 bits in size. And they
also support integer operations, but not addresses.

But yes, they could have expanded the GPRs to 128 bits, and let the
SSE and SSE2 instructions on these registers.

I think there are several reasons for having separate XMM registers:

1) Less register pressure in code that uses SSEx instructions. And
IA-32 does not have that many register names.

2) The XMM registers and the FPUs can be located separately.

3) Fewer register ports needed on each register file on superscalar implementations (i.e., all of them).

Yes, it has its costs in having to partially duplicate some integer
FUs, but they obviously thought that the benefits are worth it.

Interestingly, XMM (128 bit), YMM (256 bit), and ZMM (512 bit)
registers are not separated from each other.

Let's look at some other cases:

PA-RISC: FPU is not optional AFAIK, and integer multiplication at
least in early implementations uses the FP multiplier (like on
Willamette). PA-RISC started with a separate FP register set and soon
extended it to 58 registers or so.

88000: Started out with a unified register set (with FP doubles
represented by two 32-bit registers), and acquired a separate FP
register set with 80-bit registers in its second implementation 88110.

Power: Separate registers from the start. This may also have to do
with the first implementation being in several chips, with the FPU in
one chip, and the FXU (integer and load/store) in another chip.

Alpha: FP never was optional. Separate registers from the start, even
though its predecessor VAX has unified registers.

With new, clean sheet designs this is
much less of an issue. And even for clean sheet designs, if it is
desired to have FP an optional feature, that feature would not be a
separate chip, thus eliminating that motivation for separate registers

And yet Alpha and RISC-V wer designed with separate FP registers.

One thing is interesting about IA-32/AMD64 FP, and likewise on ARM
A32/T32 and ARM A64: The FP instruction sets of each are present in
the respective 32-bit and 64-bit instruction sets, which in case of
ARM differs a lot from the 32-bit instruction set.

So I believe that the arguments for separate FP registers, while once
valid, are no longer so.

I think that

1) the register pressure issue (for a given number of encoding bits)
is still valid.

2) Not sure if distance on the chip has become more or less of a
problem in the last years.

3) Register ports supposedly are still at a premium.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Stephen Fuld on Fri Jul 18 14:07:49 2025

Stephen Fuld wrote:

On 7/17/2025 11:03 AM, John Savard wrote:

On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:

On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

You could have an operating system that neglects to save certain
register files on interrupts, which means programs can't use them.
(There's a precedent: the Commodore 64 didn't save the status bit for
decimal mode,
so user programs couldn't use that feature of the 6502.)

Any system that is not completely transparent to interrupts is a pain in >>> the a$$ for user applications.

In no way am I denying this.

As you noted that having multiple register files makes context switching
slower, though, I was simply noting that... one can always just ignore
the extra register files. Making them useless, by not saving them
and restoring them in the interrupt handler, forces user programs to
avoid using those registers. And if they can't use them, they don't have
to save and restore them, and so context switching is speeded up!

John,
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?

Note that the unified or separate architecture ISA registers and
unified or separate physical registers can be turned into each other
at the uArch level.

It can have two rename banks for float and int but allocate physical
registers from a single PRF, or a unified rename and allocate float
instruction destination registers from a float PRF,
and integer instructions from int PRF,
and record in the unified rename which PRF each Rn resides in.

One difference is if FP128 data items or larger are supported.
The unified PRF would use 128 or more bits to store 64 bit ints
so a large fraction of the PRF could be wasted.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Anton Ertl on Fri Jul 18 11:40:30 2025

On 7/18/2025 9:24 AM, Anton Ertl wrote:

Stephen Fuld <[email protected]d> writes:

On 7/17/2025 1:32 PM, John Savard wrote:
First of all, it isn't true. Going back to the mainframe era, AFAIK, of
the major manufacturers, only IBM (S/360) had separate FP registers,
Univac, CDC and both Burroughs architectures did not.

In the CDC 6600, the A and B registers correspond to GPRs (they
support addresses), while the X registers correspond to FP registers;
they may also support integer operations, but not addresses.

Well, a different split of functions. That is essentially like the
Univac 1108, which had arithmetic registers that supported integer and
FP operations, and index registers for memory addressing. Isn't that
sort of like the Mot 68000? I do believe that CDC's X registers support
a full complement of integer operations.

big snip of good points about Intel's various SIMD register implementations.

Let's look at some other cases:

PA-RISC: FPU is not optional AFAIK, and integer multiplication at
least in early implementations uses the FP multiplier (like on
Willamette). PA-RISC started with a separate FP register set and soon extended it to 58 registers or so.

88000: Started out with a unified register set (with FP doubles
represented by two 32-bit registers), and acquired a separate FP
register set with 80-bit registers in its second implementation 88110.

Power: Separate registers from the start. This may also have to do
with the first implementation being in several chips, with the FPU in
one chip, and the FXU (integer and load/store) in another chip.

Alpha: FP never was optional. Separate registers from the start, even
though its predecessor VAX has unified registers.

Lots of good examples presenting counter evidence to my proposal. I
think it would be interesting to understand the motivations for some of
these.

So I believe that the arguments for separate FP registers, while once
valid, are no longer so.

I think that

1) the register pressure issue (for a given number of encoding bits)
is still valid.

Clearly this is a function of how many registers (and consequently
encoding bits) you have in the base design. If you only have 8, clearly
so. If you have 64, almost certainly not. The evidence Mitch presented
showed that 32 is about the knee of the curve.>

2) Not sure if distance on the chip has become more or less of a
problem in the last years.

3) Register ports supposedly are still at a premium.

Good question, and beyond my abilities to answer.

Thanks Anton It seems I need more information, particularly about your
counter examples.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Fri Jul 18 19:39:43 2025

On Fri, 18 Jul 2025 17:14:08 +0000, John Savard wrote:

On Fri, 18 Jul 2025 08:46:43 -0700, Stephen Fuld wrote:

I posit that the driving factor in the decision to have separate FP
registers was the decision to make FP instructions an optional feature,
i.e. an optional feature in S/360, part of the basic architecture in the
others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.

And assuming that to be the case, there's a smoking gun in the original System/360 architecture.

The original System/360 had only four floating-point registers. These registers weren't numbered 0, 1, 2, and 3... and they weren't numbered
0, 4, 8 and 12 either.

They were numbered 0, 2, 4 and 6.

The floating-point registers on the System/360 were 64 bits long, while
the general registers were 32 bits long.

This could suggest that the decision to make the floating-point unit
an option, with its own set of registers, instead of using pairs of
general registers for floating-point numbers, cama along late in the
design process in that case.

I suggest emulation.

I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies.

Well, since minicomputers are smaller and cheaper, nearly all of them
only had floating-point as an optional feature, except perhaps for
some machines classed as superminis.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Fri Jul 18 19:38:03 2025

On Fri, 18 Jul 2025 17:08:57 +0000, John Savard wrote:

On Fri, 18 Jul 2025 15:39:58 +0000, Waldek Hebisch wrote:

What
is a gain from single unified instruction set if you do not want to
implement all of it, but only subsets?

Having a complete ISA and then allowing a few (say 3) subsets prevents
the various subsets from preventing all instructions to be encoded
(or decoded) simultaneously--something it seems RISC-V is trying to
clean
up now.

This is a valid point, but then I need to clarify one important thing.

I don't want to implement _only_ subsets. Implementing a subset is
simply
an option. A very useful option for many applications. But implementing
the whole instruction set for a desktop PC chip is appropriate.

Implementing integer-only with no IMUL/IDIV makes for a tiny controller
CPU that can do pointer chasing but not array indexing. Whether you want something like this (or not) is implementers choice. HLL code written
in such a way that does not need/use IMUL/IDIV of FP,... will compile
and run just fine (preserving the software hierarchy.

Also, if only subsets were implemented, programs written in the basic instruction set common to all the subsets would still run on all the
chips, so that means that there is a gain even in the "only subsets"
case.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jul 18 19:45:19 2025

On Fri, 18 Jul 2025 16:24:34 +0000, Anton Ertl wrote:

Stephen Fuld <[email protected]d> writes:

On 7/17/2025 1:32 PM, John Savard wrote:
First of all, it isn't true. Going back to the mainframe era, AFAIK, of >>the major manufacturers, only IBM (S/360) had separate FP registers, >>Univac, CDC and both Burroughs architectures did not.

In the CDC 6600, the A and B registers correspond to GPRs (they
support addresses), while the X registers correspond to FP registers;
they may also support integer operations, but not addresses.

I posit that the driving factor in the decision to have separate FP >>registers was the decision to make FP instructions an optional feature, >>i.e. an optional feature in S/360, part of the basic architecture in the >>others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.

I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies. But in the early >>microprocessor era, it was clear that due to chip limitations, FP had to
be on a separate chip e.g. 8087, and the cost of crossing a chip
boundary several times for each FP instruction, which would have been >>needed if there were no on FPU separate registers, would not have been >>practical.

The 8087 was so slow that the cost of moving stuff over would have
been only a small fraction of the total time. However, the 8086 has 8
16-bit registers, not enough to hold even two 80-bit numbers for the
8087.

Backward compatibility led to this decision being
promulgated to future generations. I guess Intel could have changed
when they added the non 8087 FP instructions, but by then the mind was
set.

You mean SSE and SSE2? Note that the XMM registers of SSEx are 128
bits in size, while the GPRs of IA-32 are 32 bits in size. And they
also support integer operations, but not addresses.

But yes, they could have expanded the GPRs to 128 bits, and let the
SSE and SSE2 instructions on these registers.

I think there are several reasons for having separate XMM registers:

1) Less register pressure in code that uses SSEx instructions. And
IA-32 does not have that many register names.

2) The XMM registers and the FPUs can be located separately.

3) Fewer register ports needed on each register file on superscalar implementations (i.e., all of them).

Yes, it has its costs in having to partially duplicate some integer
FUs, but they obviously thought that the benefits are worth it.

Interestingly, XMM (128 bit), YMM (256 bit), and ZMM (512 bit)
registers are not separated from each other.

Let's look at some other cases:

PA-RISC: FPU is not optional AFAIK, and integer multiplication at
least in early implementations uses the FP multiplier (like on
Willamette). PA-RISC started with a separate FP register set and soon extended it to 58 registers or so.

88000: Started out with a unified register set (with FP doubles
represented by two 32-bit registers), and acquired a separate FP
register set with 80-bit registers in its second implementation 88110.

This was MANDATED by Apple who then 1 month later jumped ship to PPC.

Power: Separate registers from the start. This may also have to do
with the first implementation being in several chips, with the FPU in
one chip, and the FXU (integer and load/store) in another chip.

Alpha: FP never was optional. Separate registers from the start, even
though its predecessor VAX has unified registers.

With new, clean sheet designs this is
much less of an issue. And even for clean sheet designs, if it is
desired to have FP an optional feature, that feature would not be a >>separate chip, thus eliminating that motivation for separate registers

And yet Alpha and RISC-V wer designed with separate FP registers.

One thing is interesting about IA-32/AMD64 FP, and likewise on ARM
A32/T32 and ARM A64: The FP instruction sets of each are present in
the respective 32-bit and 64-bit instruction sets, which in case of
ARM differs a lot from the 32-bit instruction set.

So I believe that the arguments for separate FP registers, while once >>valid, are no longer so.

I think that

1) the register pressure issue (for a given number of encoding bits)
is still valid.

Then so is reserving registers for the dynamic linker to use,
reserving R0 as 0x0, placing IP in a GPR, ... mandating IMUL
produce a double wide result ...

2) Not sure if distance on the chip has become more or less of a
problem in the last years.

It is the product of need and frequency. If need is high and frequency
is high its a big problem. If either are moderate, the problem basically vanishes.

3) Register ports supposedly are still at a premium.

A GBOoO machine gets 75%+ of its operands from the forwarding path...

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Fri Jul 18 19:47:49 2025

John Savard <[email protected]d> schrieb:

The very reason the CRAY-I succeeded where the STAR-100 failed was that
the CRAY-I was built around vector registers, and it did its
arithmetic between those vector registers, only loading data into
them from, and writing data out from them to, the main memory.

"The Seymour Cray Era of Supercomputers" attributes this
in large parts to the fast scalar units Cray-1.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to Thomas Koenig on Sat Jul 19 02:51:08 2025

On Fri, 18 Jul 2025 19:47:49 +0000, Thomas Koenig wrote:

John Savard <[email protected]d> schrieb:

The very reason the CRAY-I succeeded where the STAR-100 failed was that
the CRAY-I was built around vector registers, and it did its arithmetic
between those vector registers, only loading data into them from, and
writing data out from them to, the main memory.

"The Seymour Cray Era of Supercomputers" attributes this in large parts
to the fast scalar units Cray-1.

Yes, the fact that Cray knew Amdahl's Law, and realized that real-world programs weren't 100% vector, and so his machine would have to have fast
scalar units is another reason his machines were very successful. In fact,
I noted this in another post in this very thread, but it's such a gigantic thread that I can't blame you for not having read every post in it.

Here, though, I focused on the vector registers, because they affected
memory bandwidth.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	58:12:44
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,395

Re: The Third Wish

Who's Online

Recent Visitors

System Info