On Mon, 23 Jun 2025 12:43:05 +0000, quadibloc wrote:
Having noted that I was using up just about the very last dregs of the
available opcode space for block headers...
I decided to dig even deeper, and create the twelfth, thirteenth, and
fourteenth header types, which, together, make the possibilities of
expanding the instruction set truly limitless, by allowing up to 128
alternate instruction sets to be added to what is currently present.
Now there are fifteen header types. The new one was added as the sixth
header type, necessitating renumbering of those that came after.
The sixth header type is a 64-bit header that prefixes a four-bit prefix
to every remaining 16 bits in the instruction.
It added the possibility of having 35-bit and 53-bit instructions.
I saw what I could use the 35-bit instructions for right away; memory-to-register operate instructions that could have all 32
registers, not just the first eight, as destinations.
But shortly afterwards, I saw a use for the 53-bit instructions: a
modified form of the string and packed decimal instructions that could
use conventional addresses with 16-bit displacements, not just the
alternate types of address with shorter displacements.
And then when I stepped back and saw what I had achieved...
it hit me what one more thing I needed to add to complete it.
So one unused code in the three-bit prefixes used in the Type III
header, carried over to the four-bit prefixes here, was now given a
purpose. It was used to indicate a "Special 16-bit instruction".
This was like the operate instructions in the 17-bit short instructions,
with a full 5-bit source register and destination register field. But
there were only six bits for the opcode. No other operations were
provided, as 17-bit short instructions already provide those.
What operations are added to the short instruction repertoire by the
special 16-bit instructions?
Mainly floating-point instructions which use the Compatible
floating-point format.
You know, the one with a sign bit, a seven bit power-of-16 exponent, and
a mantissa that can be thought of as being made up of hexadecimal
digits.
One other instruction is also included, as I don't need the whole
six-bit opcode for these instructions; the Save Return Address
instruction. I already had a way of doing this, a jump to subroutine instruction with relative addressing and a displacement of zero, but
that's 32 bits long.
So with the 25% overhead of the Type VI header... I now have achieved
almost complete isomorphism with a popular make of computer... that is,
the ability to perform the same function as any of its instructions with
an instruction that is, sort of, the same length. Thus simplifying
program conversion.
Of course, I'm sure there are a few instructions not covered, but this handles at least the bunch of its ordinary non-privileged instructions.
So from what seemed to be wretched excess that would risk making me dissatisfied with this iteration of the architecture, I have instead
achieved something that makes me very satisfied and reluctant to move
on.
John Savard
On Tue, 24 Jun 2025 17:19:05 +0000, quadibloc wrote:
So with the 25% overhead of the Type VI header... I now have achieved
almost complete isomorphism with a popular make of computer... that is,
the ability to perform the same function as any of its instructions
with an instruction that is, sort of, the same length. Thus simplifying
program conversion.
That and $5 will buy you a cup of coffee--until you write a compiler and start counting how many/few instructions you need to perform
applications.
Of course, I'm sure there are a few instructions not covered, but this
handles at least the bunch of its ordinary non-privileged instructions.
For the record, My 66000 has a single instruction that has any notion of privilege. How many do you think you will need ?!?
Thus not only is the classic RISC style of programming accomodated, if perhaps still imperfectly, but another avenue is provided to make
programs at least slightly more compact, with ten instructions placed in
the space of eight.
On Wed, 02 Jul 2025 05:16:13 +0000, John Savard wrote:
Thus not only is the classic RISC style of programming accomodated, if
perhaps still imperfectly, but another avenue is provided to make
programs at least slightly more compact, with ten instructions placed
in the space of eight.
I have found a way to address the imperfections, so this new header now allows porting RISC code to Concertina II with little conversion.
On 7/2/2025 10:52 PM, John Savard wrote:
I think it will look familiar, and will represent the absolute height of
insanity, the temptation to add to Concertina II being too strong for me
to resist.
By my count, you have just under a "gazillion" instruction, instruction formats, etc. :-)
Have you figured out how much combinatorial logic and how many gate
delays it will take to decode all of them? That might help to limit your "insanity".
On Wed, 02 Jul 2025 07:04:06 +0000, John Savard wrote:
On Wed, 02 Jul 2025 05:16:13 +0000, John Savard wrote:
Thus not only is the classic RISC style of programming accomodated, if
perhaps still imperfectly, but another avenue is provided to make
programs at least slightly more compact, with ten instructions placed
in the space of eight.
I have found a way to address the imperfections, so this new header now
allows porting RISC code to Concertina II with little conversion.
My bizarre version of RISC code, in which memory-reference instructions
are 30 bits long, while operate instructions are 24 bits long, has now
been accompanied by a different instruction format requiring another form
of the new Type III header.
This one involves dividing the block into three parts instead of five, and each of those three parts may contain three instructions that are 27 bits long or thereabouts.
I think it will look familiar, and will represent the absolute height of insanity, the temptation to add to Concertina II being too strong for me
to resist.
Stephen Fuld <[email protected]d> schrieb:
By my count, you have just under a "gazillion" instruction, instruction
formats, etc. :-)
Have you figured out how much combinatorial logic and how many gate
delays it will take to decode all of them? That might help to limit
your "insanity".
That is an excellent idea. John, if write down the Boolean equations or
the truth tables for your instruction decoding, then try to simplify
them with espresso or a tool which does multi-level logic optimization
like Berkeley ABC, you will get a much better idea of how complicated
your design actually is. ABC also does some delay calculations for you
if you map your design to a library. Highly instructive.
In any case, until sanity overtakes me, and I remove this new feature
from the Concertina II design, I have modified it to add instructions
which make use of the extended register banks. I mean, really: how can I possibly omit the most important attribute required to give this
instruction format its rightful Itanium nature?
By my count, you have just under a "gazillion" instruction, instruction formats, etc.
On Wed, 02 Jul 2025 22:57:19 -0700, Stephen Fuld wrote:
By my count, you have just under a "gazillion" instruction, instruction
formats, etc.
Well, I have now removed the pseudo-RISC and imitation Itanium
instruction formats. Instead, I've added 18-bit short instructions,
which I feel are a more appropriate and effective way of reaching the
goal of improving the ability of programs to make use of the superscalar potential of implementations of the architecture.
Without an explicit indication of parallelism, the real
theoretical maximum is sixteen-way, with 14-bit instructions
in the case without headers, and there's nothing much I can
do to improve the ease of access to that.
John Savard
Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions - and
the 18-bit instructions fell short of my goal in one important respect.
So I think they will be added soon as well.
On Thu, 03 Jul 2025 09:43:55 +0000, John Savard wrote:
In any case, until sanity overtakes me, and I remove this new feature
from the Concertina II design, I have modified it to add instructions
which make use of the extended register banks. I mean, really: how can I
possibly omit the most important attribute required to give this
instruction format its rightful Itanium nature?
Also, this exercise had a useful consequence. Adding a new instruction
format that made it easy to achieve code that can take advantage of
nine-way superscalar operation led me to review what the rest of the instruction set was doing.
A previous addition made ten-way superscalar operation possible, but
without any explicit indication of parallelism to promote it.
With 17-bit short instructions, fourteen-way superscalar operation can
be called upon without an explicit indication of parallelism; with one, though, that drops to eleven-way.
But the maximum of 14-way could only be called upon with 14-bit
instructions in the case where the pairs of instructions, at least,
had an explicit indication of parallelism. I had enough opcode space available so that I was able to improve this to also use 15-bit
instructions, to at least make it slightly more likely that the full superscalar power potentially available in this design could be used.
John Savard
On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:
Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions -
and the 18-bit instructions fell short of my goal in one important
respect. So I think they will be added soon as well.
I thought I was going to have to postpone it until I got back from doing
an errand for a friend, but this was so simple and quick an addition
that I have managed to post it to the pages, specifically:
http://www.quadibloc.com/arch/ct23int.htm http://www.quadibloc.com/arch/cad0101.htm
Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions - and
the 18-bit instructions fell short of my goal in one important respect.
So I think they will be added soon as well.
On Wed, 16 Jul 2025 00:22:39 +0000, John Savard wrote:
On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:
Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions -
and the 18-bit instructions fell short of my goal in one important
respect. So I think they will be added soon as well.
I thought I was going to have to postpone it until I got back from doing
an errand for a friend, but this was so simple and quick an addition
that I have managed to post it to the pages, specifically:
http://www.quadibloc.com/arch/ct23int.htm
http://www.quadibloc.com/arch/cad0101.htm
But having 14, 15, 16, 17, 18 and 19 bit long short instructions is definitely symptomatic of the issue you've identified.
John Savard
Stephen Fuld <[email protected]d> schrieb:
On 7/2/2025 10:52 PM, John Savard wrote:
I think it will look familiar, and will represent the absolute height of >>> insanity, the temptation to add to Concertina II being too strong for me >>> to resist.
By my count, you have just under a "gazillion" instruction, instruction
formats, etc. :-)
Have you figured out how much combinatorial logic and how many gate
delays it will take to decode all of them? That might help to limit your
"insanity".
That is an excellent idea. John, if write down the Boolean
equations or the truth tables for your instruction decoding, then
try to simplify them with espresso or a tool which does multi-level
logic optimization like Berkeley ABC, you will get a much better
idea of how complicated your design actually is. ABC also does some
delay calculations for you if you map your design to a library.
Highly instructive.
On Wed, 16 Jul 2025 00:22:39 +0000, John Savard wrote:
On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:
Also, I've noted that while I had to somewhat re-organize the block
headers for 18-bit short instructions, I could have just used other
option values for an existing format for 19-bit short instructions -
and the 18-bit instructions fell short of my goal in one important
respect. So I think they will be added soon as well.
I thought I was going to have to postpone it until I got back from doing
an errand for a friend, but this was so simple and quick an addition
that I have managed to post it to the pages, specifically:
http://www.quadibloc.com/arch/ct23int.htm
http://www.quadibloc.com/arch/cad0101.htm
But having 14, 15, 16, 17, 18 and 19 bit long short instructions is definitely symptomatic of the issue you've identified.
On Thu, 3 Jul 2025 11:24:31 +0000, John Savard wrote:
On Thu, 03 Jul 2025 09:43:55 +0000, John Savard wrote:
In any case, until sanity overtakes me, and I remove this new feature
from the Concertina II design, I have modified it to add instructions
which make use of the extended register banks. I mean, really: how can I >>> possibly omit the most important attribute required to give this
instruction format its rightful Itanium nature?
Also, this exercise had a useful consequence. Adding a new instruction
format that made it easy to achieve code that can take advantage of
nine-way superscalar operation led me to review what the rest of the
instruction set was doing.
A previous addition made ten-way superscalar operation possible, but
without any explicit indication of parallelism to promote it.
With 17-bit short instructions, fourteen-way superscalar operation can
be called upon without an explicit indication of parallelism; with one,
though, that drops to eleven-way.
We cannot believe that until you produce a compiler.
OH and btw, I can achieve 16-way decode parallelism with a variable
length encoding and nothing that marks any kind of instruction boundary--AND--I have compiler, linker, ...
Nor do I have a zillion instructions, I only have 63 patterns to
recognize.
But the maximum of 14-way could only be called upon with 14-bit
instructions in the case where the pairs of instructions, at least,
had an explicit indication of parallelism. I had enough opcode space
available so that I was able to improve this to also use 15-bit
instructions, to at least make it slightly more likely that the full
superscalar power potentially available in this design could be used.
John Savard
John Savard <[email protected]d> schrieb:
But having 14, 15, 16, 17, 18 and 19 bit long short instructions is
definitely symptomatic of the issue you've identified.
Who is "you"? You didn't quote anybody else but yourself in there.
John Savard <[email protected]d> schrieb:
But having 14, 15, 16, 17, 18 and 19 bit long short instructions is
definitely symptomatic of the issue you've identified.
Who is "you"? You didn't quote anybody else but yourself in there.
OH and btw, I can achieve 16-way decode parallelism with a variable
length encoding and nothing that marks any kind of instruction boundary--AND--I have compiler, linker, ...
Nor do I have a zillion instructions, I only have 63 patterns to
recognize.
On Wed, 16 Jul 2025 01:08:19 +0000, MitchAlsup1 wrote:
OH and btw, I can achieve 16-way decode parallelism with a variable
length encoding and nothing that marks any kind of instruction
boundary--AND--I have compiler, linker, ...
Nor do I have a zillion instructions, I only have 63 patterns to
recognize.
I certainly acknowledge that I'm not as good as you at this sort
of thing.
Theoretically, because the architecture involves separate banks of floating-point and integer registers, and there are both regular banks
with 32 registers, and extended banks with 128 registers, and
instruction formats that divide these registers into eight-register
groups (sort of like a register window, but not quite)... if it weren't
for the fact that I envisage only fetching 256 bits from memory in any
given cycle (of course, within loops, one can get instructions from
internal cache) this theoretically allows for 40-way superscalar
operation.
In practice, I doubt that anyone would write code, even carefully by
hand, that would even manage 14-way superscalar operation for very
long, so I admit it's unlikely to be terribly useful to include this in
most implementations.
The ISA is designed, though, so that (except for its immense bloat) it
could be used in a special-purpose CPU without OoO that's designed for
some kind of embedded use in, say, image processing or something where
it could be given that kind of specialized code to run.
A CPU designed instead for use in a desktop workstation would presumably
have microarchitectural capabilities appropriate to that application.
John Savard
Point of order:: all register files that have the same width (64-bits)
should be a single file.
Like fast context switches, which multiple register files PREVENTS !!
On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:
Point of order:: all register files that have the same width (64-bits)
should be a single file.
This relates to a point that occurred to me.
Many CISC microprocessors had register banks of eight registers.
RISC had register banks of 32 registers, which they thought would avoid
the need for OoO. Increasing performance demands, though,
made that no longer true.
Well, then, the extended register banks with 128 registers in them...
are there to be used by programs intended to run on chips that don't
have OoO. If an implementation does have OoO, nothing is to be gained by bothering with those registers (which still won't have rename registers associated with them, even in an OoO implementation; so OoO won't work
on the parts of the program that use them).
If things have progressed further, so that RISC chips have "big" OoO and
CISC chips have "great big" OoO, by the definitions I've given above,
then my design, to keep up, would have to provide "big" OoO for the
normal register files, but only "normal" OoO for the extended register
files.
I think that a clarification is in order here.
On Thu, 17 Jul 2025 03:42:08 +0000, John Savard wrote:<snip>
On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:
Point of order:: all register files that have the same width (64-bits)
should be a single file.
This relates to a point that occurred to me.
And there are other things related to this that have occurred to me.
You've used the term "GBOoO" - Great Big out-of-order - to describe
the current offerings of companies like Intel and AMD.
You hadn't formally defined the term, at least not in any post that
I've noticed. For purposes of discussion below, I'm going to provide
a definition which may not correspond to what you were intending.
This definition is:
In "normal" out-of-order, each register has three rename registers
associated with it, for a total of 4.
In "big" out-of-order, each register has fifteen rename registers
associated with it, for a total of 16.
In "great big" out of order, each register has sixty-three rename
registers associated with it, for a total of 64.
With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.
In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO. Since the CISC chips had register files of 8
registers, and the RISC chips had register files of 32 registers,
the two were equivalent in performance. (Given cache misses,
maybe the RISC chips still needed scoreboards.)
And then the RISC chips got "normal" OoO, and to keep up, the
CISC chips got "big" OoO.
This is the stage we would be at when I say that in my design,
the 32-register normal register banks would have OoO, but the
128-register extended register banks wouldn't.
If things have progressed further, so that RISC chips have "big"
OoO and CISC chips have "great big" OoO, by the definitions I've
given above, then my design, to keep up, would have to provide
"big" OoO for the normal register files, but only "normal" OoO
for the extended register files.
John Savard
On Thu, 17 Jul 2025 04:01:44 +0000, John Savard wrote:
If things have progressed further, so that RISC chips have "big" OoO and
CISC chips have "great big" OoO, by the definitions I've given above,
then my design, to keep up, would have to provide "big" OoO for the
normal register files, but only "normal" OoO for the extended register
files.
I think that a clarification is in order here.
I don't know, but I strongly suspect, that what you term
"Great Big out of order" is what I've called just "big"
out-of-order, and that this is already well past the point
of diminishing returns.
So that what I've called "great big" out of order is
instead something so far past the point of sanity that
I don't have to worry about it happening in real life.
But I could be wrong.
John Savard
On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:
Point of order:: all register files that have the same width (64-bits)
should be a single file.
This relates to a point that occurred to me.
Many CISC microprocessors had register banks of eight registers.
RISC had register banks of 32 registers, which they thought would
avoid the need for OoO. Increasing performance demands, though,
made that no longer true.
Well, then, the extended register banks with 128 registers in them...
are there to be used by programs intended to run on chips that don't
have OoO. If an implementation does have OoO, nothing is to be gained
by bothering with those registers (which still won't have rename
registers associated with them, even in an OoO implementation; so
OoO won't work on the parts of the program that use them).
Like fast context switches, which multiple register files PREVENTS !!
You could have an operating system that neglects to save certain
register
files on interrupts, which means programs can't use them. (There's a precedent: the Commodore 64 didn't save the status bit for decimal mode,
so user programs couldn't use that feature of the 6502.)
John Savard
On Thu, 17 Jul 2025 04:10:18 +0000, John Savard wrote:
I think that a clarification is in order here.
And, come to think of it, _another_ clarification may be needed.
If a register file with "normal" out-of-order, three rename
registers for each register, and 32 registers to a register bank, is
"equivalent" in performance to a register file with 128 registers and
no OoO support,
then the latter provides no performance benefit, so what is it there
for?
Someone might ask that who wasn't following my discussion of the
Concertina II design. So I think I had better re-iterate the point:
What the 128-register extended register files are _for_ is to
provide better performance on implementations that don't have OoO
at all, for any of the registers. They're also present on
implementations with OoO for *compatibility* reasons.
This is what may not be clear to some.
John Savard
With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.
In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO.
Since the CISC chips had register files of 8
registers,
and the RISC chips had register files of 32 registers,
(Given cache misses,
maybe the RISC chips still needed scoreboards.)
And then the RISC chips got "normal" OoO, and to keep up, the
CISC chips got "big" OoO.
If things have progressed further, so that RISC chips have "big"
OoO and CISC chips have "great big" OoO
On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:
You could have an operating system that neglects to save certain
register files on interrupts, which means programs can't use them.
(There's a precedent: the Commodore 64 didn't save the status bit for
decimal mode,
so user programs couldn't use that feature of the 6502.)
Any system that is not completely transparent to interrupts is a pain in
the a$$ for user applications.
You could have an operating system that neglects to save certain register >files on interrupts, which means programs can't use them. (There's a >precedent: the Commodore 64 didn't save the status bit for decimal mode,
so user programs couldn't use that feature of the 6502.)
As you noted that having multiple register files makes context switching slower, though, I was simply noting that... one can always just ignore
the extra register files.
On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:
On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:
You could have an operating system that neglects to save certain
register files on interrupts, which means programs can't use them.
(There's a precedent: the Commodore 64 didn't save the status bit for
decimal mode,
so user programs couldn't use that feature of the 6502.)
Any system that is not completely transparent to interrupts is a pain in
the a$$ for user applications.
In no way am I denying this.
As you noted that having multiple register files makes context switching slower, though, I was simply noting that... one can always just ignore
the extra register files. Making them useless, by not saving them
and restoring them in the interrupt handler, forces user programs to
avoid using those registers. And if they can't use them, they don't have
to save and restore them, and so context switching is speeded up!
An ISA can be implemented in different ways, and it can be used in
different ways.
John Savard <[email protected]d> writes:
With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.
In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO.
Very mythical.
No VAX (*the* CISC at the time of RISC development) implementation had
OoO, ever.
No IA-32 (CISC) implementation had OoO up to October 31, 1995, i.e.,
during the first 10 years of IA-32's existence.
HP-PA (RISC) had an OoO implementation since November 2, 1995, i.e.,
one day after IA-32.
MIPS (RISC) had an OoO implementation since January 1996, i.e., two
months after IA-32.
Since the CISC chips had register files of 8
registers,
VAX (CISC) has 15 GPRs. S/360 (CISC) has 16 GPRs.
and the RISC chips had register files of 32 registers,
ARM A32 hast 16 registers.
(Given cache misses,
maybe the RISC chips still needed scoreboards.)
Apart from MIPS R2000/R3000, every RISC has waited for results to
become ready. In the beginning stopping the pipeline was a way to do
this, but once more silicon became available and performance demands increased, other instructions often continued as far as possible.
That was often called scoreboarding, but Mitch Alsup tells us that scoreboarding on the 66000 was something more sophisticated.
And then the RISC chips got "normal" OoO, and to keep up, the
CISC chips got "big" OoO.
It would be an interesting task to unearth some PA-8000 or R10000
machine, and use Henry Wong's (IIRC) methodology to determine the
reorder buffer size and the renaming capacity of these chips. For the Coppermine Pentium III the reorder buffer size is 40 entries.
If things have progressed further, so that RISC chips have "big"
OoO and CISC chips have "great big" OoO
You can find my collection of sizes of OoO resources on
https://www.complang.tuwien.ac.at/anton/robsize/
If we look in the year 2020, we see the following number of physical
GPRs:
192 Zen3 (AMD64, CISC)
280 Sunny/Willo Cove (AMD64, CISC)
354 Apple M1 Firestorm (ARM A64, RISC)
192 Samsung M5 (ARM A64, RISC)
So there does not seem to be a systematic difference in the number of physical GPRs between CISC and RISC in recent years.
- anton
On 7/17/2025 11:03 AM, John Savard wrote:
On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:
On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:
You could have an operating system that neglects to save certain
register files on interrupts, which means programs can't use them.
(There's a precedent: the Commodore 64 didn't save the status bit for
decimal mode,
so user programs couldn't use that feature of the 6502.)
Any system that is not completely transparent to interrupts is a pain in >>> the a$$ for user applications.
In no way am I denying this.
As you noted that having multiple register files makes context switching
slower, though, I was simply noting that... one can always just ignore
the extra register files. Making them useless, by not saving them
and restoring them in the interrupt handler, forces user programs to
avoid using those registers. And if they can't use them, they don't have
to save and restore them, and so context switching is speeded up!
John,
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?
An ISA can be implemented in different ways, and it can be used in
different ways.
Of course.
On Thu, 17 Jul 2025 18:03:25 +0000, John Savard wrote:
As you noted that having multiple register files makes context switching
slower, though, I was simply noting that... one can always just ignore
the extra register files.
And speaking of ignoring features:
You've noted that it doesn't really make sense to include Cray-style
vector capabilities on a modern microprocessor, since they require
large memory bandwidth to be effective.
Since I'm designing the Concertina II ISA to support every need, I
do include such capabilities in my Long Vector instructions.
A vector register, in my architecture, consists of storage for
64 double-precision floating-point numbers. Those are certainly a
big pain to save and restore.
So what I envisage is, even in implementations of the ISA which don't
subset out the long vector feature, there will only be one set of
long vector registers in any core, and the operating system will
disable the feature by default. Programs will have to request use of
the long vector capability, and those programs will, basically, run
in batch mode (they can still communicate with the user, if they take
over the computer, i.e. as a video game) - just one such program runs
at a time, to reduce the need for saving and restoring all those
registers.
The ISA will include a number of features with the characteristic that
they are capable of improving performance when used appropriately - and
that they will massively degrade performance if used badly.
I consider that sort of thing the programmer's problem.
I'm looking more
at designing an ISA that can be used to make supercomputers than an ISA
that is idiot proof.
Particularly as an old saying claims that doing the latter is a losing battle: they'll always design better idiots.
John Savard
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?
Few can afford the costs to build a CRAY-like vector machine more due to
pin counts, interconnect, and DRAM memory; than CPU internals.
What do you do with program that request said feature and it is NOT
PRESENT !! (That is compatible up and down)
On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:
Few can afford the costs to build a CRAY-like vector machine more due to
pin counts, interconnect, and DRAM memory; than CPU internals.
I do remember seeing a photo of the CPU from the NEC SX6;
it was connected to a memory bus that was sixteen times as
wide as a normal PC memory bus.
Using dual-channel or quad-channel memory on high-performance
PCs is quite conventional these days.
So, while I agree that doing what the SX6 did is not
likely to be feasible except in extreme cases, splitting
the difference and using an eight-channel interface is
something I suspect would be doable.
Thus, I figure that *half* the performance of a NEC Aurora
TSUBASA would already be good enough to be a big improvement
over conventional microprocessors.
Initially, before having your input, the reason I started putting
in a CRAY-like vector feature in the original Concertina design,
aside from simply wanting to explain how it worked, and how it
differed from modern SIMD vector designs, was that...
microprocessors seemed to have evolved from 8-bit designs to minicomputer-like designs to mainframe-like designs. The
Pentium Pro and Pentium II took this to the limit of mainframe-like
designs, by attaining a performance-oriented design strongly
resembling the IBM System/360 Model 195.
What else did computers, prior to the microcomputer era do to be
more powerful? What else remained for further progress? Well, there
was _one_ old computer that went beyond the 195... the CRAY-I and
those which followed it.
So it seemed like there was still one step to take in making individual
cores more powerful before going to the expedient of putting a
parallel sysplex system on a chip (IBM-speak for multicore).
That was naive reasoning, of course, so I can't really mount a
strong counterargument against your claims that this is not workable.
After all, although NEC _has_ managed to continue to make a
vector supercomputer design for today, it's in a video card style
form factor instead of being a CPU that sits directly on top of
a motherboard.
What do you do with program that request said feature and it is NOT
PRESENT !! (That is compatible up and down)
Obviously, there is _nothing_ that can be done with programs that
require a feature that isn't implemented which is fully
upwards and downwards compatible. For full compatibility, every
feature must be implemented.
The Concertina II ISA, however, is, as has been noted, rather
bloated. So I don't see it as at all unreasonable to divide the
architecture into a basic architecture, which all programs can
expect to have available, and specialized features which are
only present on special-purpose implementations.
Just as you wouldn't write code for a Cray, or a TMS320C6000,
and expect it to run on an IBM 360, the special-purpose
implementations of Concertina II are different enough in
function that they should be regarded as different machines -
even if they are also able to run standard programs for the
architecture.
The option, though, is also present to implement all features
for compatibility, but implement some features badly. So
some register files are too enormous to put on the chip? Fine;
put them in RAM!
John Savard
On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:
Few can afford the costs to build a CRAY-like vector machine more due
to pin counts, interconnect, and DRAM memory; than CPU internals.
Initially, before having your input, the reason I started putting in a CRAY-like vector feature in the original Concertina design,
aside from simply wanting to explain how it worked, and how it differed
from modern SIMD vector designs, was that...
microprocessors seemed to have evolved from 8-bit designs to minicomputer-like designs to mainframe-like designs. The Pentium Pro and Pentium II took this to the limit of mainframe-like designs, by
attaining a performance-oriented design strongly resembling the IBM System/360 Model 195.
What else did computers, prior to the microcomputer era do to be more powerful? What else remained for further progress? Well, there was _one_
old computer that went beyond the 195... the CRAY-I and those which
followed it.
On Thu, 17 Jul 2025 21:00:36 +0000, John Savard wrote:
Using dual-channel or quad-channel memory on high-performance PCs is
quite conventional these days.
Fully 2 decimal orders of magnitude too small.
So, while I agree that doing what the SX6 did is not likely to be
feasible except in extreme cases, splitting the difference and using an
eight-channel interface is something I suspect would be doable.
So the fact that an in-order CRAY-I style vector unit would be bolted on
to a GBOoO scalar CPU is... accepted as inevitable, rather than seen as
a contradiction, at least by this unworthy one.
On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?
I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.
Having separate integer and floating-point registers?
- that is what nearly everyone else does;
- integers and floating-point numbers are different in format, so
it is not useful to perform operations meant for one type of number
on the other type of number;
- the opcode indicates whether an operation is an integer operation
or a floating-point operation, so having these two sets of registers
lets you have twice as many registers without having to add a bit
to the register fields in the instruction.
The third point, of course, is the only _real_ advantage.
*And* my architecture is specified as performing a transformation on floating-point numbers during load and store operations to make register-to-register arithmetic faster.
So I'm taking advantage of the need for separate floating-point
load and store instructions to derive a performance gain from
them. (The problem is that the hidden first bit and denormals,
if properly handled, only involve a small number of gate delays,
so this is unlikely to produce a genuine advantage.)
On 7/17/2025 1:32 PM, John Savard wrote:
Having separate integer and floating-point registers?
- that is what nearly everyone else does;
So if everyone else jumped off a roof????
That's a lot of chip area and wiring.
OK, so scratch that advantage. :-(
(I also plan to drop the demand of the IEEE
754 standard that division always produce the best rounded result, in
order to speed up floating-point division by methods such as Goldschmidt
and Newton-Raphson)
On 7/17/2025 1:32 PM, John Savard wrote:
On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?
I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.
Sorry, you are right. So you now have four sets of registers (integer, float, SIMD, additional)?
That's a lot of chip area and wiring.
Having separate integer and floating-point registers?
- that is what nearly everyone else does;
So if everyone else jumped off a roof????
- integers and floating-point numbers are different in format, so
it is not useful to perform operations meant for one type of number
on the other type of number;
Not necessarily. There are things that one wants to do to floating
point numbers that are "integer register like", such as extract the
exponent. I think in a related post, someone (Mitch?) gave a more
complete list. So you either have to provide extra instructions to do
these things on the FP registers, or suffer the cost of instructions to
move the value from the FP registers to the integer registers.
- the opcode indicates whether an operation is an integer operation
or a floating-point operation, so having these two sets of registers
lets you have twice as many registers without having to add a bit
to the register fields in the instruction.
True. The question is, how valuable those extra registers are? If you already have 32 integer registers, isn't that enough for almost every purpose?
The third point, of course, is the only _real_ advantage.
*And* my architecture is specified as performing a transformation on
floating-point numbers during load and store operations to make
register-to-register arithmetic faster.
Yes, I had forgotten about that.
So I'm taking advantage of the need for separate floating-point
load and store instructions to derive a performance gain from
them. (The problem is that the hidden first bit and denormals,
if properly handled, only involve a small number of gate delays,
so this is unlikely to produce a genuine advantage.)
OK, so scratch that advantage. :-(
On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:
Few can afford the costs to build a CRAY-like vector machine more due to
pin counts, interconnect, and DRAM memory; than CPU internals.
I do remember seeing a photo of the CPU from the NEC SX6;
it was connected to a memory bus that was sixteen times as
wide as a normal PC memory bus.
Using dual-channel or quad-channel memory on high-performance
PCs is quite conventional these days.
So, while I agree that doing what the SX6 did is not
likely to be feasible except in extreme cases, splitting
the difference and using an eight-channel interface is
something I suspect would be doable.
Thus, I figure that *half* the performance of a NEC Aurora
TSUBASA would already be good enough to be a big improvement
over conventional microprocessors.
What do you do with program that request said feature and it is NOT
PRESENT !! (That is compatible up and down)
Obviously, there is _nothing_ that can be done with programs that
require a feature that isn't implemented which is fully
upwards and downwards compatible. For full compatibility, every
feature must be implemented.
The Concertina II ISA, however, is, as has been noted, rather
bloated. So I don't see it as at all unreasonable to divide the
architecture into a basic architecture, which all programs can
expect to have available, and specialized features which are
only present on special-purpose implementations.
Just as you wouldn't write code for a Cray, or a TMS320C6000,
and expect it to run on an IBM 360, the special-purpose
implementations of Concertina II are different enough in
function that they should be regarded as different machines -
even if they are also able to run standard programs for the
architecture.
On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?
I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.
Having separate integer and floating-point registers?
- that is what nearly everyone else does;
On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:
Well, then, the extended register banks with 128 registers in them...
AMD 29K ?!?
are there to be used by programs intended to run on chips that don't
have OoO. If an implementation does have OoO, nothing is to be gained
by bothering with those registers (which still won't have rename
registers associated with them, even in an OoO implementation; so
OoO won't work on the parts of the program that use them).
Somewhat right:: given a pool of 128 (or 256) rename registers,
one can make even an 8-register machine run fast and rather
well. The thing is that a 32-register ISA has a 15%-18% speed
advantage over an 8-register machine--whereas a 64 register
machine only has a 3% speed advantage over a 32 register
machine (MIPS circa 1982). At some point not having enough
registers hurts (that somewhere is in the mid 20-s of registers)
and at some point the size of the RF limits read performance
(that somewhere is between 32 registers and 64 registers).
On Thu, 17 Jul 2025 16:28:56 +0000, Anton Ertl wrote:
John Savard <[email protected]d> writes:
With this definition of the typical implementation in each size
class of OoO, one can construct a mythical history of sorts.
In "the beginning", CISC chips had normal OoO, and RISC chips
did not have OoO.
Very mythical.
No VAX (*the* CISC at the time of RISC development) implementation had
OoO, ever.
No IA-32 (CISC) implementation had OoO up to October 31, 1995, i.e.,
during the first 10 years of IA-32's existence.
HP-PA (RISC) had an OoO implementation since November 2, 1995, i.e.,
one day after IA-32.
MIPS (RISC) had an OoO implementation since January 1996, i.e., two
months after IA-32.
Since the CISC chips had register files of 8
registers,
VAX (CISC) has 15 GPRs. S/360 (CISC) has 16 GPRs.
and the RISC chips had register files of 32 registers,
ARM A32 hast 16 registers.
(Given cache misses,
maybe the RISC chips still needed scoreboards.)
Apart from MIPS R2000/R3000, every RISC has waited for results to
become ready. In the beginning stopping the pipeline was a way to do
this, but once more silicon became available and performance demands
increased, other instructions often continued as far as possible.
That was often called scoreboarding, but Mitch Alsup tells us that
scoreboarding on the 66000 was something more sophisticated.
CDC 6600 (no third 0) had a SB that scheduled the start of an
instruction
(at register read) and then later scheduled the completion of an inst- ruction (register write) in a way that prevented RAW, WAR, and WAW
hazards.
Renaming gets rid of the xAW hazards in SBs, Reservation Stations,
Dispatch stacks, and others.
Point of order:: all register files that have the same width (64-bits)
should be a single file. This makes varargs easier, allows using integer operations on FP operands (extract exponent, insert exponent, copysign)
which are mandated by the standards. Either you have an integer set of registers and a FP set of registers and a nearly complete set of integer operations on FP registers, or you can dispense with the nonsense and
have a single general purpose register file.
I have evidence (data) indicating My 66000 with only 32-registers
AND universal constants needs fewer registers than RISC-V with
32 integer and 32 FP registers on many applications, including
those you think need 32+32.
On 7/17/2025 1:32 PM, John Savard wrote:
On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?
I had thought the _current_ conversation was about how context switching
was made more painful by my additional sets of 128 registers.
Having separate integer and floating-point registers?
- that is what nearly everyone else does;
Although I gave a flip, and snarky, response to this before, I got to >thinking about why it is true. I came up with the following, though it
it a quick and dirty answer and I welcome other's comments and corrections.
First of all, it isn't true. Going back to the mainframe era, AFAIK, of
the major manufacturers, only IBM (S/360) had separate FP registers,
Univac, CDC and both Burroughs architectures did not.
What
is a gain from single unified instruction set if you do not want to
implement all of it, but only subsets?
I posit that the driving factor in the decision to have separate FP
registers was the decision to make FP instructions an optional feature,
i.e. an optional feature in S/360, part of the basic architecture in the others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.
I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies.
On 7/17/2025 1:32 PM, John Savard wrote:
First of all, it isn't true. Going back to the mainframe era, AFAIK, of
the major manufacturers, only IBM (S/360) had separate FP registers,
Univac, CDC and both Burroughs architectures did not.
I posit that the driving factor in the decision to have separate FP
registers was the decision to make FP instructions an optional feature,
i.e. an optional feature in S/360, part of the basic architecture in the >others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.
I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies. But in the early >microprocessor era, it was clear that due to chip limitations, FP had to
be on a separate chip e.g. 8087, and the cost of crossing a chip
boundary several times for each FP instruction, which would have been
needed if there were no on FPU separate registers, would not have been >practical.
Backward compatibility led to this decision being
promulgated to future generations. I guess Intel could have changed
when they added the non 8087 FP instructions, but by then the mind was
set.
With new, clean sheet designs this is
much less of an issue. And even for clean sheet designs, if it is
desired to have FP an optional feature, that feature would not be a
separate chip, thus eliminating that motivation for separate registers
So I believe that the arguments for separate FP registers, while once
valid, are no longer so.
On 7/17/2025 11:03 AM, John Savard wrote:
On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:
On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:
You could have an operating system that neglects to save certain
register files on interrupts, which means programs can't use them.
(There's a precedent: the Commodore 64 didn't save the status bit for
decimal mode,
so user programs couldn't use that feature of the 6502.)
Any system that is not completely transparent to interrupts is a pain in >>> the a$$ for user applications.
In no way am I denying this.
As you noted that having multiple register files makes context switching
slower, though, I was simply noting that... one can always just ignore
the extra register files. Making them useless, by not saving them
and restoring them in the interrupt handler, forces user programs to
avoid using those registers. And if they can't use them, they don't have
to save and restore them, and so context switching is speeded up!
John,
Mitch has made the case for a combined integer/floating point register
set. What is your case for having them separate?
Stephen Fuld <[email protected]d> writes:
On 7/17/2025 1:32 PM, John Savard wrote:
First of all, it isn't true. Going back to the mainframe era, AFAIK, of
the major manufacturers, only IBM (S/360) had separate FP registers,
Univac, CDC and both Burroughs architectures did not.
In the CDC 6600, the A and B registers correspond to GPRs (they
support addresses), while the X registers correspond to FP registers;
they may also support integer operations, but not addresses.
Let's look at some other cases:
PA-RISC: FPU is not optional AFAIK, and integer multiplication at
least in early implementations uses the FP multiplier (like on
Willamette). PA-RISC started with a separate FP register set and soon extended it to 58 registers or so.
88000: Started out with a unified register set (with FP doubles
represented by two 32-bit registers), and acquired a separate FP
register set with 80-bit registers in its second implementation 88110.
Power: Separate registers from the start. This may also have to do
with the first implementation being in several chips, with the FPU in
one chip, and the FXU (integer and load/store) in another chip.
Alpha: FP never was optional. Separate registers from the start, even
though its predecessor VAX has unified registers.
So I believe that the arguments for separate FP registers, while once
valid, are no longer so.
I think that
1) the register pressure issue (for a given number of encoding bits)
is still valid.
2) Not sure if distance on the chip has become more or less of a
problem in the last years.
3) Register ports supposedly are still at a premium.
On Fri, 18 Jul 2025 08:46:43 -0700, Stephen Fuld wrote:
I posit that the driving factor in the decision to have separate FP
registers was the decision to make FP instructions an optional feature,
i.e. an optional feature in S/360, part of the basic architecture in the
others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.
And assuming that to be the case, there's a smoking gun in the original System/360 architecture.
The original System/360 had only four floating-point registers. These registers weren't numbered 0, 1, 2, and 3... and they weren't numbered
0, 4, 8 and 12 either.
They were numbered 0, 2, 4 and 6.
The floating-point registers on the System/360 were 64 bits long, while
the general registers were 32 bits long.
This could suggest that the decision to make the floating-point unit
an option, with its own set of registers, instead of using pairs of
general registers for floating-point numbers, cama along late in the
design process in that case.
I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies.
Well, since minicomputers are smaller and cheaper, nearly all of them
only had floating-point as an optional feature, except perhaps for
some machines classed as superminis.
John Savard
On Fri, 18 Jul 2025 15:39:58 +0000, Waldek Hebisch wrote:
What
is a gain from single unified instruction set if you do not want to
implement all of it, but only subsets?
This is a valid point, but then I need to clarify one important thing.
I don't want to implement _only_ subsets. Implementing a subset is
simply
an option. A very useful option for many applications. But implementing
the whole instruction set for a desktop PC chip is appropriate.
Also, if only subsets were implemented, programs written in the basic instruction set common to all the subsets would still run on all the
chips, so that means that there is a gain even in the "only subsets"
case.
John Savard
Stephen Fuld <[email protected]d> writes:
On 7/17/2025 1:32 PM, John Savard wrote:
First of all, it isn't true. Going back to the mainframe era, AFAIK, of >>the major manufacturers, only IBM (S/360) had separate FP registers, >>Univac, CDC and both Burroughs architectures did not.
In the CDC 6600, the A and B registers correspond to GPRs (they
support addresses), while the X registers correspond to FP registers;
they may also support integer operations, but not addresses.
I posit that the driving factor in the decision to have separate FP >>registers was the decision to make FP instructions an optional feature, >>i.e. an optional feature in S/360, part of the basic architecture in the >>others. Apparently, adding FP operations as a separate feature made
using the existing registers just too hard to implement.
I can't comment on the mini-computer era, as I don't know enough about
the various architectures and marketing strategies. But in the early >>microprocessor era, it was clear that due to chip limitations, FP had to
be on a separate chip e.g. 8087, and the cost of crossing a chip
boundary several times for each FP instruction, which would have been >>needed if there were no on FPU separate registers, would not have been >>practical.
The 8087 was so slow that the cost of moving stuff over would have
been only a small fraction of the total time. However, the 8086 has 8
16-bit registers, not enough to hold even two 80-bit numbers for the
8087.
Backward compatibility led to this decision being
promulgated to future generations. I guess Intel could have changed
when they added the non 8087 FP instructions, but by then the mind was
set.
You mean SSE and SSE2? Note that the XMM registers of SSEx are 128
bits in size, while the GPRs of IA-32 are 32 bits in size. And they
also support integer operations, but not addresses.
But yes, they could have expanded the GPRs to 128 bits, and let the
SSE and SSE2 instructions on these registers.
I think there are several reasons for having separate XMM registers:
1) Less register pressure in code that uses SSEx instructions. And
IA-32 does not have that many register names.
2) The XMM registers and the FPUs can be located separately.
3) Fewer register ports needed on each register file on superscalar implementations (i.e., all of them).
Yes, it has its costs in having to partially duplicate some integer
FUs, but they obviously thought that the benefits are worth it.
Interestingly, XMM (128 bit), YMM (256 bit), and ZMM (512 bit)
registers are not separated from each other.
Let's look at some other cases:
PA-RISC: FPU is not optional AFAIK, and integer multiplication at
least in early implementations uses the FP multiplier (like on
Willamette). PA-RISC started with a separate FP register set and soon extended it to 58 registers or so.
88000: Started out with a unified register set (with FP doubles
represented by two 32-bit registers), and acquired a separate FP
register set with 80-bit registers in its second implementation 88110.
Power: Separate registers from the start. This may also have to do
with the first implementation being in several chips, with the FPU in
one chip, and the FXU (integer and load/store) in another chip.
Alpha: FP never was optional. Separate registers from the start, even
though its predecessor VAX has unified registers.
With new, clean sheet designs this is
much less of an issue. And even for clean sheet designs, if it is
desired to have FP an optional feature, that feature would not be a >>separate chip, thus eliminating that motivation for separate registers
And yet Alpha and RISC-V wer designed with separate FP registers.
One thing is interesting about IA-32/AMD64 FP, and likewise on ARM
A32/T32 and ARM A64: The FP instruction sets of each are present in
the respective 32-bit and 64-bit instruction sets, which in case of
ARM differs a lot from the 32-bit instruction set.
So I believe that the arguments for separate FP registers, while once >>valid, are no longer so.
I think that
1) the register pressure issue (for a given number of encoding bits)
is still valid.
2) Not sure if distance on the chip has become more or less of a
problem in the last years.
3) Register ports supposedly are still at a premium.
- anton
The very reason the CRAY-I succeeded where the STAR-100 failed was that
the CRAY-I was built around vector registers, and it did its
arithmetic between those vector registers, only loading data into
them from, and writing data out from them to, the main memory.
John Savard <[email protected]d> schrieb:
The very reason the CRAY-I succeeded where the STAR-100 failed was that
the CRAY-I was built around vector registers, and it did its arithmetic
between those vector registers, only loading data into them from, and
writing data out from them to, the main memory.
"The Seymour Cray Era of Supercomputers" attributes this in large parts
to the fast scalar units Cray-1.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 25:36:10 |
| Calls: | 12,106 |
| Calls today: | 6 |
| Files: | 15,006 |
| Messages: | 6,518,183 |