• Re: The Third Wish

    From MitchAlsup1@21:1/5 to quadibloc on Tue Jun 24 21:57:22 2025
    On Tue, 24 Jun 2025 17:19:05 +0000, quadibloc wrote:

    On Mon, 23 Jun 2025 12:43:05 +0000, quadibloc wrote:

    Having noted that I was using up just about the very last dregs of the
    available opcode space for block headers...

    I decided to dig even deeper, and create the twelfth, thirteenth, and
    fourteenth header types, which, together, make the possibilities of
    expanding the instruction set truly limitless, by allowing up to 128
    alternate instruction sets to be added to what is currently present.

    Now there are fifteen header types. The new one was added as the sixth
    header type, necessitating renumbering of those that came after.

    The sixth header type is a 64-bit header that prefixes a four-bit prefix
    to every remaining 16 bits in the instruction.

    It added the possibility of having 35-bit and 53-bit instructions.

    I saw what I could use the 35-bit instructions for right away; memory-to-register operate instructions that could have all 32
    registers, not just the first eight, as destinations.

    But shortly afterwards, I saw a use for the 53-bit instructions: a
    modified form of the string and packed decimal instructions that could
    use conventional addresses with 16-bit displacements, not just the
    alternate types of address with shorter displacements.

    And then when I stepped back and saw what I had achieved...

    it hit me what one more thing I needed to add to complete it.

    So one unused code in the three-bit prefixes used in the Type III
    header, carried over to the four-bit prefixes here, was now given a
    purpose. It was used to indicate a "Special 16-bit instruction".

    This was like the operate instructions in the 17-bit short instructions,
    with a full 5-bit source register and destination register field. But
    there were only six bits for the opcode. No other operations were
    provided, as 17-bit short instructions already provide those.

    What operations are added to the short instruction repertoire by the
    special 16-bit instructions?

    Mainly floating-point instructions which use the Compatible
    floating-point format.

    You know, the one with a sign bit, a seven bit power-of-16 exponent, and
    a mantissa that can be thought of as being made up of hexadecimal
    digits.

    One other instruction is also included, as I don't need the whole
    six-bit opcode for these instructions; the Save Return Address
    instruction. I already had a way of doing this, a jump to subroutine instruction with relative addressing and a displacement of zero, but
    that's 32 bits long.

    So with the 25% overhead of the Type VI header... I now have achieved
    almost complete isomorphism with a popular make of computer... that is,
    the ability to perform the same function as any of its instructions with
    an instruction that is, sort of, the same length. Thus simplifying
    program conversion.

    That and $5 will buy you a cup of coffee--until you write a compiler
    and start counting how many/few instructions you need to perform
    applications.

    Of course, I'm sure there are a few instructions not covered, but this handles at least the bunch of its ordinary non-privileged instructions.

    For the record, My 66000 has a single instruction that has any notion
    of privilege. How many do you think you will need ?!?

    So from what seemed to be wretched excess that would risk making me dissatisfied with this iteration of the architecture, I have instead
    achieved something that makes me very satisfied and reluctant to move
    on.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Wed Jun 25 07:31:55 2025
    On Tue, 24 Jun 2025 21:57:22 +0000, MitchAlsup1 wrote:
    On Tue, 24 Jun 2025 17:19:05 +0000, quadibloc wrote:

    So with the 25% overhead of the Type VI header... I now have achieved
    almost complete isomorphism with a popular make of computer... that is,
    the ability to perform the same function as any of its instructions
    with an instruction that is, sort of, the same length. Thus simplifying
    program conversion.

    That and $5 will buy you a cup of coffee--until you write a compiler and start counting how many/few instructions you need to perform
    applications.

    Well, once I get the spec frozen, I'll be closer to being able to do such things.

    Of course, I'm sure there are a few instructions not covered, but this
    handles at least the bunch of its ordinary non-privileged instructions.

    For the record, My 66000 has a single instruction that has any notion of privilege. How many do you think you will need ?!?

    As I haven't tried to improve or simplify this part of computer
    architecture, presumably as many as any typical modern computer
    architecture, like z/Architecture or x86.

    Incidentally, the sixth header type is now the fourth; this was in order
    to give it extra opcode space, so that I could add one bit to the header
    with an additional function. If this header type allows me to imitate the instruction set of S/360, I have a bit to imitate an aspect of its
    register layout so as to facilitate communication with actual S/360 code running in emulation on the same chip.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Fri Jun 27 05:28:47 2025
    An additional minor tweak has been added to the architecture.

    Since there is a header type (currently type VIII) which allows
    full VLIW functionality with variable-length instructions, and
    that header type had several spare bits available within it,
    one has now been purposed to indicate a modification to the
    17-bit short instructions which are available when variable-length
    instructions are used.

    The architecture includes extended register banks with 128 registers
    in each of them, in addition to the regular register banks with 32
    registers. VLIW-style code is the kind of code likely to find using
    the extended registers to be of benefit; so, now, selecting the
    alternate form of the 17-bit instructions causes the normal 17-bit
    instructions that work with the normal banks of 32 registers to ones
    that work with the extended banks of 128 registers - thus allowing
    more threads to be interleaved, thus keeping dependent instructions
    further away from each other, just the thing for allowing more
    instructions to execute in parallel.

    Also added is a colorful diagram making it plain what I am up to
    when I modify floating-point register usage with the Emulation bit
    in headers of Type IV.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Wed Jul 2 05:16:13 2025
    I have noted that the amount of opcode space available for headers
    had been needlessly compromised because of a faulty ordering of memory-reference instructions, so I re-ordered them to make the
    available space less constricting.
    As well, I have had complaints that my use of base-index addressing
    as opposed to the plain addressing of actual RISC instruction sets
    is not really helpful, since the value placed in the index register
    often has to be calculated in any case, and so only one arithmetic
    step is saved, not the need for calculation.
    Therefore, I have placed a new header in the third position in my
    list of headers which is 16 bits long, and which divides the remainder
    of the block into five 48-bit areas. These may contain either two
    24-bit instructions, which are operate instructions of the RISC type,
    or three 15-bit instructions, of my own special short instruction type,
    or one such 15-bit instruction, and a 30-bit memory-reference
    instruction of the typical RISC type.
    Thus not only is the classic RISC style of programming accomodated, if
    perhaps still imperfectly, but another avenue is provided to make
    programs at least slightly more compact, with ten instructions placed
    in the space of eight.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Wed Jul 2 07:04:06 2025
    On Wed, 02 Jul 2025 05:16:13 +0000, John Savard wrote:

    Thus not only is the classic RISC style of programming accomodated, if perhaps still imperfectly, but another avenue is provided to make
    programs at least slightly more compact, with ten instructions placed in
    the space of eight.

    I have found a way to address the imperfections, so this new header now
    allows porting RISC code to Concertina II with little conversion.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Thu Jul 3 05:52:00 2025
    On Wed, 02 Jul 2025 07:04:06 +0000, John Savard wrote:
    On Wed, 02 Jul 2025 05:16:13 +0000, John Savard wrote:

    Thus not only is the classic RISC style of programming accomodated, if
    perhaps still imperfectly, but another avenue is provided to make
    programs at least slightly more compact, with ten instructions placed
    in the space of eight.

    I have found a way to address the imperfections, so this new header now allows porting RISC code to Concertina II with little conversion.

    My bizarre version of RISC code, in which memory-reference instructions
    are 30 bits long, while operate instructions are 24 bits long, has now
    been accompanied by a different instruction format requiring another form
    of the new Type III header.

    This one involves dividing the block into three parts instead of five, and
    each of those three parts may contain three instructions that are 27 bits
    long or thereabouts.

    I think it will look familiar, and will represent the absolute height of insanity, the temptation to add to Concertina II being too strong for me
    to resist.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Stephen Fuld on Thu Jul 3 06:59:37 2025
    Stephen Fuld <[email protected]d> schrieb:
    On 7/2/2025 10:52 PM, John Savard wrote:

    I think it will look familiar, and will represent the absolute height of
    insanity, the temptation to add to Concertina II being too strong for me
    to resist.

    By my count, you have just under a "gazillion" instruction, instruction formats, etc. :-)

    Have you figured out how much combinatorial logic and how many gate
    delays it will take to decode all of them? That might help to limit your "insanity".

    That is an excellent idea. John, if write down the Boolean
    equations or the truth tables for your instruction decoding, then
    try to simplify them with espresso or a tool which does multi-level
    logic optimization like Berkeley ABC, you will get a much better
    idea of how complicated your design actually is. ABC also does some
    delay calculations for you if you map your design to a library.
    Highly instructive.

    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Savard on Wed Jul 2 22:57:19 2025
    On 7/2/2025 10:52 PM, John Savard wrote:
    On Wed, 02 Jul 2025 07:04:06 +0000, John Savard wrote:
    On Wed, 02 Jul 2025 05:16:13 +0000, John Savard wrote:

    Thus not only is the classic RISC style of programming accomodated, if
    perhaps still imperfectly, but another avenue is provided to make
    programs at least slightly more compact, with ten instructions placed
    in the space of eight.

    I have found a way to address the imperfections, so this new header now
    allows porting RISC code to Concertina II with little conversion.

    My bizarre version of RISC code, in which memory-reference instructions
    are 30 bits long, while operate instructions are 24 bits long, has now
    been accompanied by a different instruction format requiring another form
    of the new Type III header.

    This one involves dividing the block into three parts instead of five, and each of those three parts may contain three instructions that are 27 bits long or thereabouts.

    I think it will look familiar, and will represent the absolute height of insanity, the temptation to add to Concertina II being too strong for me
    to resist.

    By my count, you have just under a "gazillion" instruction, instruction formats, etc. :-)

    Have you figured out how much combinatorial logic and how many gate
    delays it will take to decode all of them? That might help to limit your "insanity".


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Thu Jul 3 11:35:33 2025
    Without an explicit indication of parallelism, the real
    theoretical maximum is sixteen-way, with 14-bit instructions
    in the case without headers, and there's nothing much I can
    do to improve the ease of access to that.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Thomas Koenig on Thu Jul 3 09:43:55 2025
    On Thu, 03 Jul 2025 06:59:37 +0000, Thomas Koenig wrote:
    Stephen Fuld <[email protected]d> schrieb:

    By my count, you have just under a "gazillion" instruction, instruction
    formats, etc. :-)

    Have you figured out how much combinatorial logic and how many gate
    delays it will take to decode all of them? That might help to limit
    your "insanity".

    That is an excellent idea. John, if write down the Boolean equations or
    the truth tables for your instruction decoding, then try to simplify
    them with espresso or a tool which does multi-level logic optimization
    like Berkeley ABC, you will get a much better idea of how complicated
    your design actually is. ABC also does some delay calculations for you
    if you map your design to a library. Highly instructive.

    In any case, until sanity overtakes me, and I remove this new feature
    from the Concertina II design, I have modified it to add instructions
    which make use of the extended register banks. I mean, really: how can
    I possibly omit the most important attribute required to give this
    instruction format its rightful Itanium nature?

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Thu Jul 3 11:24:31 2025
    On Thu, 03 Jul 2025 09:43:55 +0000, John Savard wrote:

    In any case, until sanity overtakes me, and I remove this new feature
    from the Concertina II design, I have modified it to add instructions
    which make use of the extended register banks. I mean, really: how can I possibly omit the most important attribute required to give this
    instruction format its rightful Itanium nature?

    Also, this exercise had a useful consequence. Adding a new instruction
    format that made it easy to achieve code that can take advantage of
    nine-way superscalar operation led me to review what the rest of the instruction set was doing.

    A previous addition made ten-way superscalar operation possible, but
    without any explicit indication of parallelism to promote it.

    With 17-bit short instructions, fourteen-way superscalar operation can
    be called upon without an explicit indication of parallelism; with one,
    though, that drops to eleven-way.

    But the maximum of 14-way could only be called upon with 14-bit
    instructions in the case where the pairs of instructions, at least,
    had an explicit indication of parallelism. I had enough opcode space
    available so that I was able to improve this to also use 15-bit
    instructions, to at least make it slightly more likely that the full superscalar power potentially available in this design could be used.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Stephen Fuld on Tue Jul 15 22:24:14 2025
    On Wed, 02 Jul 2025 22:57:19 -0700, Stephen Fuld wrote:

    By my count, you have just under a "gazillion" instruction, instruction formats, etc.

    Well, I have now removed the pseudo-RISC and imitation Itanium
    instruction formats. Instead, I've added 18-bit short instructions,
    which I feel are a more appropriate and effective way of reaching
    the goal of improving the ability of programs to make use of the
    superscalar potential of implementations of the architecture.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Wed Jul 16 00:00:47 2025
    On Tue, 15 Jul 2025 22:24:14 +0000, John Savard wrote:

    On Wed, 02 Jul 2025 22:57:19 -0700, Stephen Fuld wrote:

    By my count, you have just under a "gazillion" instruction, instruction
    formats, etc.

    Well, I have now removed the pseudo-RISC and imitation Itanium
    instruction formats. Instead, I've added 18-bit short instructions,
    which I feel are a more appropriate and effective way of reaching the
    goal of improving the ability of programs to make use of the superscalar potential of implementations of the architecture.

    Also, it might be noted that I did go to some lengths to ensure that all
    the instructions in a block could be decoded in parallel, to help deal
    with the gate delays involved in instruction decoding. Of course, with
    that step pipelined, the delay is only felt once per branch rather than
    for every instruction.

    Also, I've noted that while I had to somewhat re-organize the block
    headers for 18-bit short instructions, I could have just used other
    option values for an existing format for 19-bit short instructions - and
    the 18-bit instructions fell short of my goal in one important respect.
    So I think they will be added soon as well.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 01:11:24 2025
    On Thu, 3 Jul 2025 11:35:33 +0000, John Savard wrote:

    Without an explicit indication of parallelism, the real
    theoretical maximum is sixteen-way, with 14-bit instructions
    in the case without headers, and there's nothing much I can
    do to improve the ease of access to that.

    As noted above, I am doing 16-way decode on variable length
    instructions with no marking bits.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Wed Jul 16 00:22:39 2025
    On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

    Also, I've noted that while I had to somewhat re-organize the block
    headers for 18-bit short instructions, I could have just used other
    option values for an existing format for 19-bit short instructions - and
    the 18-bit instructions fell short of my goal in one important respect.
    So I think they will be added soon as well.

    I thought I was going to have to postpone it until I got back from doing
    an errand for a friend, but this was so simple and quick an addition that
    I have managed to post it to the pages, specifically:

    http://www.quadibloc.com/arch/ct23int.htm http://www.quadibloc.com/arch/cad0101.htm

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 01:08:19 2025
    On Thu, 3 Jul 2025 11:24:31 +0000, John Savard wrote:

    On Thu, 03 Jul 2025 09:43:55 +0000, John Savard wrote:

    In any case, until sanity overtakes me, and I remove this new feature
    from the Concertina II design, I have modified it to add instructions
    which make use of the extended register banks. I mean, really: how can I
    possibly omit the most important attribute required to give this
    instruction format its rightful Itanium nature?

    Also, this exercise had a useful consequence. Adding a new instruction
    format that made it easy to achieve code that can take advantage of
    nine-way superscalar operation led me to review what the rest of the instruction set was doing.

    A previous addition made ten-way superscalar operation possible, but
    without any explicit indication of parallelism to promote it.

    With 17-bit short instructions, fourteen-way superscalar operation can
    be called upon without an explicit indication of parallelism; with one, though, that drops to eleven-way.

    We cannot believe that until you produce a compiler.

    OH and btw, I can achieve 16-way decode parallelism with a variable
    length encoding and nothing that marks any kind of instruction
    boundary--AND--I have compiler, linker, ...

    Nor do I have a zillion instructions, I only have 63 patterns to
    recognize.

    But the maximum of 14-way could only be called upon with 14-bit
    instructions in the case where the pairs of instructions, at least,
    had an explicit indication of parallelism. I had enough opcode space available so that I was able to improve this to also use 15-bit
    instructions, to at least make it slightly more likely that the full superscalar power potentially available in this design could be used.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Wed Jul 16 01:49:50 2025
    On Wed, 16 Jul 2025 00:22:39 +0000, John Savard wrote:

    On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

    Also, I've noted that while I had to somewhat re-organize the block
    headers for 18-bit short instructions, I could have just used other
    option values for an existing format for 19-bit short instructions -
    and the 18-bit instructions fell short of my goal in one important
    respect. So I think they will be added soon as well.

    I thought I was going to have to postpone it until I got back from doing
    an errand for a friend, but this was so simple and quick an addition
    that I have managed to post it to the pages, specifically:

    http://www.quadibloc.com/arch/ct23int.htm http://www.quadibloc.com/arch/cad0101.htm

    But having 14, 15, 16, 17, 18 and 19 bit long short instructions is
    definitely symptomatic of the issue you've identified.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Wed Jul 16 12:26:29 2025
    On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

    Also, I've noted that while I had to somewhat re-organize the block
    headers for 18-bit short instructions, I could have just used other
    option values for an existing format for 19-bit short instructions - and
    the 18-bit instructions fell short of my goal in one important respect.
    So I think they will be added soon as well.

    Not only have I done that, but now, by dint of a great effort, I've
    squeezed out enough opcode space to get the header format I _really_
    wanted that allows 14-way superscalar operation when only instructions
    of the alternate 17-bit short instruction format are needed in a given
    block.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 18:06:59 2025
    On Wed, 16 Jul 2025 1:49:50 +0000, John Savard wrote:

    On Wed, 16 Jul 2025 00:22:39 +0000, John Savard wrote:

    On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

    Also, I've noted that while I had to somewhat re-organize the block
    headers for 18-bit short instructions, I could have just used other
    option values for an existing format for 19-bit short instructions -
    and the 18-bit instructions fell short of my goal in one important
    respect. So I think they will be added soon as well.

    I thought I was going to have to postpone it until I got back from doing
    an errand for a friend, but this was so simple and quick an addition
    that I have managed to post it to the pages, specifically:

    http://www.quadibloc.com/arch/ct23int.htm
    http://www.quadibloc.com/arch/cad0101.htm

    But having 14, 15, 16, 17, 18 and 19 bit long short instructions is definitely symptomatic of the issue you've identified.

    John Savard

    Conversing with yourself or arguing with yourself ?? !!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Jul 16 18:09:30 2025
    On Thu, 3 Jul 2025 6:59:37 +0000, Thomas Koenig wrote:

    Stephen Fuld <[email protected]d> schrieb:
    On 7/2/2025 10:52 PM, John Savard wrote:

    I think it will look familiar, and will represent the absolute height of >>> insanity, the temptation to add to Concertina II being too strong for me >>> to resist.

    By my count, you have just under a "gazillion" instruction, instruction
    formats, etc. :-)

    Have you figured out how much combinatorial logic and how many gate
    delays it will take to decode all of them? That might help to limit your
    "insanity".

    That is an excellent idea. John, if write down the Boolean
    equations or the truth tables for your instruction decoding, then

    About 25 lines of Verilog for My 66000 (the logic not the data structure definitions)

    try to simplify them with espresso or a tool which does multi-level
    logic optimization like Berkeley ABC, you will get a much better
    idea of how complicated your design actually is. ABC also does some
    delay calculations for you if you map your design to a library.
    Highly instructive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Wed Jul 16 17:33:41 2025
    John Savard <[email protected]d> schrieb:
    On Wed, 16 Jul 2025 00:22:39 +0000, John Savard wrote:

    On Wed, 16 Jul 2025 00:00:47 +0000, John Savard wrote:

    Also, I've noted that while I had to somewhat re-organize the block
    headers for 18-bit short instructions, I could have just used other
    option values for an existing format for 19-bit short instructions -
    and the 18-bit instructions fell short of my goal in one important
    respect. So I think they will be added soon as well.

    I thought I was going to have to postpone it until I got back from doing
    an errand for a friend, but this was so simple and quick an addition
    that I have managed to post it to the pages, specifically:

    http://www.quadibloc.com/arch/ct23int.htm
    http://www.quadibloc.com/arch/cad0101.htm

    But having 14, 15, 16, 17, 18 and 19 bit long short instructions is definitely symptomatic of the issue you've identified.

    Who is "you"? You didn't quote anybody else but yourself in there.

    I'm a bit mystified...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to All on Wed Jul 16 18:22:10 2025
    On Wed, 16 Jul 2025 1:08:19 +0000, MitchAlsup1 wrote:

    On Thu, 3 Jul 2025 11:24:31 +0000, John Savard wrote:

    On Thu, 03 Jul 2025 09:43:55 +0000, John Savard wrote:

    In any case, until sanity overtakes me, and I remove this new feature
    from the Concertina II design, I have modified it to add instructions
    which make use of the extended register banks. I mean, really: how can I >>> possibly omit the most important attribute required to give this
    instruction format its rightful Itanium nature?

    Also, this exercise had a useful consequence. Adding a new instruction
    format that made it easy to achieve code that can take advantage of
    nine-way superscalar operation led me to review what the rest of the
    instruction set was doing.

    A previous addition made ten-way superscalar operation possible, but
    without any explicit indication of parallelism to promote it.

    With 17-bit short instructions, fourteen-way superscalar operation can
    be called upon without an explicit indication of parallelism; with one,
    though, that drops to eleven-way.

    We cannot believe that until you produce a compiler.

    OH and btw, I can achieve 16-way decode parallelism with a variable
    length encoding and nothing that marks any kind of instruction boundary--AND--I have compiler, linker, ...

    Nor do I have a zillion instructions, I only have 63 patterns to
    recognize.

    I should expand on this::
    There are 4 groups of instructions where we use the top 2-bits of
    the major OpCode::

    00 OpCode extensions
    01 Control transfer group
    10 Memory reference with 16-bit displacement
    11 Calculation with 16-bit immediate

    Of these::

    000 Predication and shifts of constant (saves imm16 space 12-bits imm)
    001 is the only group that has variable length
    010 is LOOP
    011 is conventional branch
    100 it LDs with disp16
    101 is STs with disp16
    110 is integer with imm16
    111 is logical with imm16

    All very RISC-like at this point. In the VLE group; Inst<15:13,11>
    provide all the bits for operand routing and for VLE instruction
    length--they are mashed up together to reduce entropy.

    But the maximum of 14-way could only be called upon with 14-bit
    instructions in the case where the pairs of instructions, at least,
    had an explicit indication of parallelism. I had enough opcode space
    available so that I was able to improve this to also use 15-bit
    instructions, to at least make it slightly more likely that the full
    superscalar power potentially available in this design could be used.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Thomas Koenig on Wed Jul 16 23:24:52 2025
    On Wed, 16 Jul 2025 17:33:41 +0000, Thomas Koenig wrote:
    John Savard <[email protected]d> schrieb:

    But having 14, 15, 16, 17, 18 and 19 bit long short instructions is
    definitely symptomatic of the issue you've identified.

    Who is "you"? You didn't quote anybody else but yourself in there.

    Whoever noted I had zillions of instructions and instruction formats.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Thomas Koenig on Wed Jul 16 23:26:53 2025
    On Wed, 16 Jul 2025 17:33:41 +0000, Thomas Koenig wrote:
    John Savard <[email protected]d> schrieb:

    But having 14, 15, 16, 17, 18 and 19 bit long short instructions is
    definitely symptomatic of the issue you've identified.

    Who is "you"? You didn't quote anybody else but yourself in there.

    Stephen Fuld.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Wed Jul 16 23:36:11 2025
    On Wed, 16 Jul 2025 01:08:19 +0000, MitchAlsup1 wrote:

    OH and btw, I can achieve 16-way decode parallelism with a variable
    length encoding and nothing that marks any kind of instruction boundary--AND--I have compiler, linker, ...

    Nor do I have a zillion instructions, I only have 63 patterns to
    recognize.

    I certainly acknowledge that I'm not as good as you at this sort
    of thing.

    Theoretically, because the architecture involves separate banks of floating-point and integer registers, and there are both regular banks
    with 32 registers, and extended banks with 128 registers, and
    instruction formats that divide these registers into eight-register
    groups (sort of like a register window, but not quite)... if it weren't
    for the fact that I envisage only fetching 256 bits from memory in any
    given cycle (of course, within loops, one can get instructions from
    internal cache) this theoretically allows for 40-way superscalar
    operation.

    In practice, I doubt that anyone would write code, even carefully by
    hand, that would even manage 14-way superscalar operation for very
    long, so I admit it's unlikely to be terribly useful to include this in
    most implementations.

    The ISA is designed, though, so that (except for its immense bloat) it
    could be used in a special-purpose CPU without OoO that's designed for
    some kind of embedded use in, say, image processing or something where
    it could be given that kind of specialized code to run.

    A CPU designed instead for use in a desktop workstation would presumably
    have microarchitectural capabilities appropriate to that application.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed Jul 16 23:58:41 2025
    On Wed, 16 Jul 2025 23:36:11 +0000, John Savard wrote:

    On Wed, 16 Jul 2025 01:08:19 +0000, MitchAlsup1 wrote:

    OH and btw, I can achieve 16-way decode parallelism with a variable
    length encoding and nothing that marks any kind of instruction
    boundary--AND--I have compiler, linker, ...

    Nor do I have a zillion instructions, I only have 63 patterns to
    recognize.

    I certainly acknowledge that I'm not as good as you at this sort
    of thing.

    Theoretically, because the architecture involves separate banks of floating-point and integer registers, and there are both regular banks
    with 32 registers, and extended banks with 128 registers, and

    Point of order:: all register files that have the same width (64-bits)
    should be a single file. This makes varargs easier, allows using integer operations on FP operands (extract exponent, insert exponent, copysign)
    which are mandated by the standards. Either you have an integer set of registers and a FP set of registers and a nearly complete set of integer operations on FP registers, or you can dispense with the nonsense and
    have a single general purpose register file.

    I have evidence (data) indicating My 66000 with only 32-registers
    AND universal constants needs fewer registers than RISC-V with
    32 integer and 32 FP registers on many applications, including
    those you think need 32+32.

    instruction formats that divide these registers into eight-register
    groups (sort of like a register window, but not quite)... if it weren't
    for the fact that I envisage only fetching 256 bits from memory in any
    given cycle (of course, within loops, one can get instructions from
    internal cache) this theoretically allows for 40-way superscalar
    operation.

    I have always been a cynic to this partition.

    In practice, I doubt that anyone would write code, even carefully by
    hand, that would even manage 14-way superscalar operation for very
    long, so I admit it's unlikely to be terribly useful to include this in
    most implementations.

    That is why VLIW is failing or has failed.

    The ISA is designed, though, so that (except for its immense bloat) it
    could be used in a special-purpose CPU without OoO that's designed for
    some kind of embedded use in, say, image processing or something where
    it could be given that kind of specialized code to run.

    LoL.

    A CPU designed instead for use in a desktop workstation would presumably
    have microarchitectural capabilities appropriate to that application.

    Like fast context switches, which multiple register files PREVENTS !!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Thu Jul 17 03:42:08 2025
    On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:

    Point of order:: all register files that have the same width (64-bits)
    should be a single file.

    This relates to a point that occurred to me.

    Many CISC microprocessors had register banks of eight registers.

    RISC had register banks of 32 registers, which they thought would
    avoid the need for OoO. Increasing performance demands, though,
    made that no longer true.

    Well, then, the extended register banks with 128 registers in them...

    are there to be used by programs intended to run on chips that don't
    have OoO. If an implementation does have OoO, nothing is to be gained
    by bothering with those registers (which still won't have rename
    registers associated with them, even in an OoO implementation; so
    OoO won't work on the parts of the program that use them).

    Like fast context switches, which multiple register files PREVENTS !!

    You could have an operating system that neglects to save certain register
    files on interrupts, which means programs can't use them. (There's a
    precedent: the Commodore 64 didn't save the status bit for decimal mode,
    so user programs couldn't use that feature of the 6502.)

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Thu Jul 17 04:01:44 2025
    On Thu, 17 Jul 2025 03:42:08 +0000, John Savard wrote:
    On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:

    Point of order:: all register files that have the same width (64-bits)
    should be a single file.

    This relates to a point that occurred to me.

    Many CISC microprocessors had register banks of eight registers.

    RISC had register banks of 32 registers, which they thought would avoid
    the need for OoO. Increasing performance demands, though,
    made that no longer true.

    Well, then, the extended register banks with 128 registers in them...

    are there to be used by programs intended to run on chips that don't
    have OoO. If an implementation does have OoO, nothing is to be gained by bothering with those registers (which still won't have rename registers associated with them, even in an OoO implementation; so OoO won't work
    on the parts of the program that use them).

    And there are other things related to this that have occurred to me.

    You've used the term "GBOoO" - Great Big out-of-order - to describe
    the current offerings of companies like Intel and AMD.

    You hadn't formally defined the term, at least not in any post that
    I've noticed. For purposes of discussion below, I'm going to provide
    a definition which may not correspond to what you were intending.

    This definition is:

    In "normal" out-of-order, each register has three rename registers
    associated with it, for a total of 4.

    In "big" out-of-order, each register has fifteen rename registers
    associated with it, for a total of 16.

    In "great big" out of order, each register has sixty-three rename
    registers associated with it, for a total of 64.

    With this definition of the typical implementation in each size
    class of OoO, one can construct a mythical history of sorts.

    In "the beginning", CISC chips had normal OoO, and RISC chips
    did not have OoO. Since the CISC chips had register files of 8
    registers, and the RISC chips had register files of 32 registers,
    the two were equivalent in performance. (Given cache misses,
    maybe the RISC chips still needed scoreboards.)

    And then the RISC chips got "normal" OoO, and to keep up, the
    CISC chips got "big" OoO.

    This is the stage we would be at when I say that in my design,
    the 32-register normal register banks would have OoO, but the
    128-register extended register banks wouldn't.

    If things have progressed further, so that RISC chips have "big"
    OoO and CISC chips have "great big" OoO, by the definitions I've
    given above, then my design, to keep up, would have to provide
    "big" OoO for the normal register files, but only "normal" OoO
    for the extended register files.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Thu Jul 17 04:10:18 2025
    On Thu, 17 Jul 2025 04:01:44 +0000, John Savard wrote:

    If things have progressed further, so that RISC chips have "big" OoO and
    CISC chips have "great big" OoO, by the definitions I've given above,
    then my design, to keep up, would have to provide "big" OoO for the
    normal register files, but only "normal" OoO for the extended register
    files.

    I think that a clarification is in order here.

    I don't know, but I strongly suspect, that what you term
    "Great Big out of order" is what I've called just "big"
    out-of-order, and that this is already well past the point
    of diminishing returns.

    So that what I've called "great big" out of order is
    instead something so far past the point of sanity that
    I don't have to worry about it happening in real life.

    But I could be wrong.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Thu Jul 17 04:24:18 2025
    On Thu, 17 Jul 2025 04:10:18 +0000, John Savard wrote:

    I think that a clarification is in order here.

    And, come to think of it, _another_ clarification may be needed.

    If a register file with "normal" out-of-order, three rename
    registers for each register, and 32 registers to a register bank, is

    "equivalent" in performance to a register file with 128 registers and
    no OoO support,

    then the latter provides no performance benefit, so what is it there
    for?

    Someone might ask that who wasn't following my discussion of the
    Concertina II design. So I think I had better re-iterate the point:

    What the 128-register extended register files are _for_ is to
    provide better performance on implementations that don't have OoO
    at all, for any of the registers. They're also present on
    implementations with OoO for *compatibility* reasons.

    This is what may not be clear to some.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 14:59:27 2025
    On Thu, 17 Jul 2025 4:01:44 +0000, John Savard wrote:

    On Thu, 17 Jul 2025 03:42:08 +0000, John Savard wrote:
    On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:

    Point of order:: all register files that have the same width (64-bits)
    should be a single file.

    This relates to a point that occurred to me.
    <snip>

    And there are other things related to this that have occurred to me.

    You've used the term "GBOoO" - Great Big out-of-order - to describe
    the current offerings of companies like Intel and AMD.

    You hadn't formally defined the term, at least not in any post that
    I've noticed. For purposes of discussion below, I'm going to provide
    a definition which may not correspond to what you were intending.

    This definition is:

    In "normal" out-of-order, each register has three rename registers
    associated with it, for a total of 4.

    The in-order Mc 88110 had 2 rename registers per register.

    In "big" out-of-order, each register has fifteen rename registers
    associated with it, for a total of 16.

    In "great big" out of order, each register has sixty-three rename
    registers associated with it, for a total of 64.

    Mc 88120 has 96 rename registers and 32 architectural registers
    in a pool of 128. Rk could be renamed 96 times--so, if you wrote
    code using a single register, it could still fill the execution
    window with work.

    This is what I mean by GBOoO::
    Instructions can be issued out of DECODE into instructions queues
    until an instruction queue becomes full and another instruction
    needs that queue;
    AND
    The register rename pool continues to have register renames available.

    So DECODE does not stall until it runs out of register renames and
    still has instruction queue entries to capture issued instructions.

    This was the Mc 88120 design point by 1992.

    With this definition of the typical implementation in each size
    class of OoO, one can construct a mythical history of sorts.

    In "the beginning", CISC chips had normal OoO, and RISC chips
    did not have OoO. Since the CISC chips had register files of 8
    registers, and the RISC chips had register files of 32 registers,
    the two were equivalent in performance. (Given cache misses,
    maybe the RISC chips still needed scoreboards.)

    I think it is fairer to say that with big compiler programmable
    register files, CISC needed GBOoO before RISC did.

    Secondarily, CISC was supported by big-$$$ cache flow and could
    afford the silicon costs and design team costs before RISCs
    could.

    And then the RISC chips got "normal" OoO, and to keep up, the
    CISC chips got "big" OoO.

    As noted above, Mc 88120 was GBOoO in design in 1992 targeting
    1995 for production. The PowerPC distraction took not just 88K
    down, but Moto too.

    This is the stage we would be at when I say that in my design,
    the 32-register normal register banks would have OoO, but the
    128-register extended register banks wouldn't.

    If things have progressed further, so that RISC chips have "big"
    OoO and CISC chips have "great big" OoO, by the definitions I've
    given above, then my design, to keep up, would have to provide
    "big" OoO for the normal register files, but only "normal" OoO
    for the extended register files.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 15:02:15 2025
    On Thu, 17 Jul 2025 4:10:18 +0000, John Savard wrote:

    On Thu, 17 Jul 2025 04:01:44 +0000, John Savard wrote:

    If things have progressed further, so that RISC chips have "big" OoO and
    CISC chips have "great big" OoO, by the definitions I've given above,
    then my design, to keep up, would have to provide "big" OoO for the
    normal register files, but only "normal" OoO for the extended register
    files.

    I think that a clarification is in order here.

    I don't know, but I strongly suspect, that what you term
    "Great Big out of order" is what I've called just "big"
    out-of-order, and that this is already well past the point
    of diminishing returns.

    So that what I've called "great big" out of order is
    instead something so far past the point of sanity that
    I don't have to worry about it happening in real life.

    Another way to look at it is::

    Once the instruction queueing mechanism gets big enough that
    instruction scheduling is no longer needed, the compiler's
    job is simply to produce the fewest number of instructions
    that performs the semantic duties at hand.

    But I could be wrong.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 14:49:16 2025
    On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

    On Wed, 16 Jul 2025 23:58:41 +0000, MitchAlsup1 wrote:

    Point of order:: all register files that have the same width (64-bits)
    should be a single file.

    This relates to a point that occurred to me.

    Many CISC microprocessors had register banks of eight registers.

    RISC had register banks of 32 registers, which they thought would
    avoid the need for OoO. Increasing performance demands, though,
    made that no longer true.

    Somewhat true, but basically untrue--from a 1983 perspective::

    Some RISC marketeers stated that the big RF was like a programmable
    cache--even though it was not.

    We the designers and architects know OoO was waiting until enough
    transistors could be had.

    Well, then, the extended register banks with 128 registers in them...

    AMD 29K ?!?

    are there to be used by programs intended to run on chips that don't
    have OoO. If an implementation does have OoO, nothing is to be gained
    by bothering with those registers (which still won't have rename
    registers associated with them, even in an OoO implementation; so
    OoO won't work on the parts of the program that use them).

    Somewhat right:: given a pool of 128 (or 256) rename registers,
    one can make even an 8-register machine run fast and rather
    well. The thing is that a 32-register ISA has a 15%-18% speed
    advantage over an 8-register machine--whereas a 64 register
    machine only has a 3% speed advantage over a 32 register
    machine (MIPS circa 1982). At some point not having enough
    registers hurts (that somewhere is in the mid 20-s of registers)
    and at some point the size of the RF limits read performance
    (that somewhere is between 32 registers and 64 registers).

    1st generation RISC machines would read RF in ½ cycle and
    perform forwarding in that same decode (1) cycle. As machines
    got faster and as renaming took hold reading RF became 1 cycle
    and forwarding became another 1 cycle (around 2 GHz).

    Like fast context switches, which multiple register files PREVENTS !!

    You could have an operating system that neglects to save certain
    register
    files on interrupts, which means programs can't use them. (There's a precedent: the Commodore 64 didn't save the status bit for decimal mode,
    so user programs couldn't use that feature of the 6502.)

    Any system that is not completely transparent to interrupts is
    a pain in the a$$ for user applications.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 15:16:59 2025
    On Thu, 17 Jul 2025 4:24:18 +0000, John Savard wrote:

    On Thu, 17 Jul 2025 04:10:18 +0000, John Savard wrote:

    I think that a clarification is in order here.

    And, come to think of it, _another_ clarification may be needed.

    If a register file with "normal" out-of-order, three rename
    registers for each register, and 32 registers to a register bank, is

    "equivalent" in performance to a register file with 128 registers and
    no OoO support,

    then the latter provides no performance benefit, so what is it there
    for?

    That is NOT HOW ONE RENAMES !!!!!!!

    One has a pool of rename registers. DECODE has a demand for rename
    registers (Rd), and retire has a supply of rename registers (Write RD
    into RF), and ANY rename register can stand in for ANY architectural
    register. As long as the pool is not depleted, everything runs smoothly.

    Someone might ask that who wasn't following my discussion of the
    Concertina II design. So I think I had better re-iterate the point:

    What the 128-register extended register files are _for_ is to
    provide better performance on implementations that don't have OoO
    at all, for any of the registers. They're also present on
    implementations with OoO for *compatibility* reasons.

    ........... CPU time . Reg Access
    .8 registers 1.30 .... 1/4
    16 registers 1.15 .... 1/3
    32 registers 1.00 .... 2/4
    64 registers 0.97 .... 3/4
    128 registers 0.96 .... 4/4

    32 is the knee of the curve. HW always wants to operate at the knee
    of the curve (Bill Moyer 1982). If you make the Register file as
    big as the cache it will take just as long as the cache to access
    (Andy Glew circa 1995).

    The only good arguments I have heard wrt big architectural register
    files has to do with things like Register-Windows and/or optimizing
    CALL/RET interface.

    BUT (the big but) adding cycles to the pipeline degrades performance
    for ALL instructions, not just the ones that use registers 32..128 !!
    {{like doubling the size of L2 and adding 1 cycle of added latency
    ends up running slower 50% of the time--choose your L2 latency with
    care (Przybylski).}}

    This is what may not be clear to some.

    Some == You

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Savard on Thu Jul 17 16:28:56 2025
    John Savard <[email protected]d> writes:
    With this definition of the typical implementation in each size
    class of OoO, one can construct a mythical history of sorts.

    In "the beginning", CISC chips had normal OoO, and RISC chips
    did not have OoO.

    Very mythical.

    No VAX (*the* CISC at the time of RISC development) implementation had
    OoO, ever.

    No IA-32 (CISC) implementation had OoO up to October 31, 1995, i.e.,
    during the first 10 years of IA-32's existence.

    HP-PA (RISC) had an OoO implementation since November 2, 1995, i.e.,
    one day after IA-32.

    MIPS (RISC) had an OoO implementation since January 1996, i.e., two
    months after IA-32.

    Since the CISC chips had register files of 8
    registers,

    VAX (CISC) has 15 GPRs. S/360 (CISC) has 16 GPRs.

    and the RISC chips had register files of 32 registers,

    ARM A32 hast 16 registers.

    (Given cache misses,
    maybe the RISC chips still needed scoreboards.)

    Apart from MIPS R2000/R3000, every RISC has waited for results to
    become ready. In the beginning stopping the pipeline was a way to do
    this, but once more silicon became available and performance demands
    increased, other instructions often continued as far as possible.
    That was often called scoreboarding, but Mitch Alsup tells us that scoreboarding on the 66000 was something more sophisticated.

    And then the RISC chips got "normal" OoO, and to keep up, the
    CISC chips got "big" OoO.

    It would be an interesting task to unearth some PA-8000 or R10000
    machine, and use Henry Wong's (IIRC) methodology to determine the
    reorder buffer size and the renaming capacity of these chips. For the Coppermine Pentium III the reorder buffer size is 40 entries.

    If things have progressed further, so that RISC chips have "big"
    OoO and CISC chips have "great big" OoO

    You can find my collection of sizes of OoO resources on

    https://www.complang.tuwien.ac.at/anton/robsize/

    If we look in the year 2020, we see the following number of physical
    GPRs:

    192 Zen3 (AMD64, CISC)
    280 Sunny/Willo Cove (AMD64, CISC)
    354 Apple M1 Firestorm (ARM A64, RISC)
    192 Samsung M5 (ARM A64, RISC)

    So there does not seem to be a systematic difference in the number of
    physical GPRs between CISC and RISC in recent years.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Thu Jul 17 18:03:25 2025
    On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

    You could have an operating system that neglects to save certain
    register files on interrupts, which means programs can't use them.
    (There's a precedent: the Commodore 64 didn't save the status bit for
    decimal mode,
    so user programs couldn't use that feature of the 6502.)

    Any system that is not completely transparent to interrupts is a pain in
    the a$$ for user applications.

    In no way am I denying this.

    As you noted that having multiple register files makes context switching slower, though, I was simply noting that... one can always just ignore
    the extra register files. Making them useless, by not saving them
    and restoring them in the interrupt handler, forces user programs to
    avoid using those registers. And if they can't use them, they don't have
    to save and restore them, and so context switching is speeded up!

    An ISA can be implemented in different ways, and it can be used in
    different ways.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to John Savard on Thu Jul 17 17:06:50 2025
    John Savard <[email protected]d> writes:
    You could have an operating system that neglects to save certain register >files on interrupts, which means programs can't use them. (There's a >precedent: the Commodore 64 didn't save the status bit for decimal mode,
    so user programs couldn't use that feature of the 6502.)

    That's bullshit. The 6502/6510 saves the P register, which contains
    the decimal bit on an interrupt, and RTI restores it. What is
    apparently necessary is that the interrupt handler clears (or sets)
    the decimal flag if it uses adc or sbc. But given that some people
    have actually made use of decimal mode <https://www.lemon64.com/forum/viewtopic.php?p=63214&sid=4146454cf46e58e74dff37b0a5f5f6b4#p63214>,
    the interrupt handlers by Commodore obviously do that correctly.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Thu Jul 17 18:27:26 2025
    On Thu, 17 Jul 2025 18:03:25 +0000, John Savard wrote:

    As you noted that having multiple register files makes context switching slower, though, I was simply noting that... one can always just ignore
    the extra register files.

    And speaking of ignoring features:

    You've noted that it doesn't really make sense to include Cray-style
    vector capabilities on a modern microprocessor, since they require
    large memory bandwidth to be effective.

    Since I'm designing the Concertina II ISA to support every need, I
    do include such capabilities in my Long Vector instructions.

    A vector register, in my architecture, consists of storage for
    64 double-precision floating-point numbers. Those are certainly a
    big pain to save and restore.

    So what I envisage is, even in implementations of the ISA which don't
    subset out the long vector feature, there will only be one set of
    long vector registers in any core, and the operating system will
    disable the feature by default. Programs will have to request use of
    the long vector capability, and those programs will, basically, run
    in batch mode (they can still communicate with the user, if they take
    over the computer, i.e. as a video game) - just one such program runs
    at a time, to reduce the need for saving and restoring all those
    registers.

    The ISA will include a number of features with the characteristic that
    they are capable of improving performance when used appropriately - and
    that they will massively degrade performance if used badly.

    I consider that sort of thing the programmer's problem. I'm looking more
    at designing an ISA that can be used to make supercomputers than an ISA
    that is idiot proof. Particularly as an old saying claims that doing the
    latter is a losing battle: they'll always design better idiots.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Savard on Thu Jul 17 12:20:18 2025
    On 7/17/2025 11:03 AM, John Savard wrote:
    On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

    You could have an operating system that neglects to save certain
    register files on interrupts, which means programs can't use them.
    (There's a precedent: the Commodore 64 didn't save the status bit for
    decimal mode,
    so user programs couldn't use that feature of the 6502.)

    Any system that is not completely transparent to interrupts is a pain in
    the a$$ for user applications.

    In no way am I denying this.

    As you noted that having multiple register files makes context switching slower, though, I was simply noting that... one can always just ignore
    the extra register files. Making them useless, by not saving them
    and restoring them in the interrupt handler, forces user programs to
    avoid using those registers. And if they can't use them, they don't have
    to save and restore them, and so context switching is speeded up!

    John,
    Mitch has made the case for a combined integer/floating point register
    set. What is your case for having them separate?


    An ISA can be implemented in different ways, and it can be used in
    different ways.

    Of course.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Thu Jul 17 19:21:48 2025
    On Thu, 17 Jul 2025 16:28:56 +0000, Anton Ertl wrote:

    John Savard <[email protected]d> writes:
    With this definition of the typical implementation in each size
    class of OoO, one can construct a mythical history of sorts.

    In "the beginning", CISC chips had normal OoO, and RISC chips
    did not have OoO.

    Very mythical.

    No VAX (*the* CISC at the time of RISC development) implementation had
    OoO, ever.

    No IA-32 (CISC) implementation had OoO up to October 31, 1995, i.e.,
    during the first 10 years of IA-32's existence.

    HP-PA (RISC) had an OoO implementation since November 2, 1995, i.e.,
    one day after IA-32.

    MIPS (RISC) had an OoO implementation since January 1996, i.e., two
    months after IA-32.

    Since the CISC chips had register files of 8
    registers,

    VAX (CISC) has 15 GPRs. S/360 (CISC) has 16 GPRs.

    and the RISC chips had register files of 32 registers,

    ARM A32 hast 16 registers.

    (Given cache misses,
    maybe the RISC chips still needed scoreboards.)

    Apart from MIPS R2000/R3000, every RISC has waited for results to
    become ready. In the beginning stopping the pipeline was a way to do
    this, but once more silicon became available and performance demands increased, other instructions often continued as far as possible.
    That was often called scoreboarding, but Mitch Alsup tells us that scoreboarding on the 66000 was something more sophisticated.

    CDC 6600 (no third 0) had a SB that scheduled the start of an
    instruction
    (at register read) and then later scheduled the completion of an inst-
    ruction (register write) in a way that prevented RAW, WAR, and WAW
    hazards.

    Renaming gets rid of the xAW hazards in SBs, Reservation Stations,
    Dispatch stacks, and others.

    And then the RISC chips got "normal" OoO, and to keep up, the
    CISC chips got "big" OoO.

    It would be an interesting task to unearth some PA-8000 or R10000
    machine, and use Henry Wong's (IIRC) methodology to determine the
    reorder buffer size and the renaming capacity of these chips. For the Coppermine Pentium III the reorder buffer size is 40 entries.

    If things have progressed further, so that RISC chips have "big"
    OoO and CISC chips have "great big" OoO

    You can find my collection of sizes of OoO resources on

    https://www.complang.tuwien.ac.at/anton/robsize/

    If we look in the year 2020, we see the following number of physical
    GPRs:

    192 Zen3 (AMD64, CISC)
    280 Sunny/Willo Cove (AMD64, CISC)
    354 Apple M1 Firestorm (ARM A64, RISC)
    192 Samsung M5 (ARM A64, RISC)

    So there does not seem to be a systematic difference in the number of physical GPRs between CISC and RISC in recent years.

    Agreed, you add physical registers and execution window space until
    you run out of a) area, or b) power, or c) (less likely) pipeline
    stages.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Thu Jul 17 19:52:05 2025
    On Thu, 17 Jul 2025 19:20:18 +0000, Stephen Fuld wrote:

    On 7/17/2025 11:03 AM, John Savard wrote:
    On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

    You could have an operating system that neglects to save certain
    register files on interrupts, which means programs can't use them.
    (There's a precedent: the Commodore 64 didn't save the status bit for
    decimal mode,
    so user programs couldn't use that feature of the 6502.)

    Any system that is not completely transparent to interrupts is a pain in >>> the a$$ for user applications.

    In no way am I denying this.

    As you noted that having multiple register files makes context switching
    slower, though, I was simply noting that... one can always just ignore
    the extra register files. Making them useless, by not saving them
    and restoring them in the interrupt handler, forces user programs to
    avoid using those registers. And if they can't use them, they don't have
    to save and restore them, and so context switching is speeded up!

    John,
    Mitch has made the case for a combined integer/floating point register
    set. What is your case for having them separate?

    While waiting for John's response::

    Combined registers WITH universal constants is better* than separate
    register files WITHOUT universal constants.

    (*) maybe "not worse than on average" is more accurate than "better
    than on average". One can find cases that go either way.

    History:: Circa 1999-2004 K9 project at AMD, we discovered that the
    x86-64 16 register file, the x87 Register File, and the (then) MMX
    (soon to be) AVX register fil could all be serviced by a single
    rename register pool. Which is what we did. THEN to deal with "all
    those FUs" we designed the data path as a short path {integer, logical,
    AGEN, Shift} and a long path {IMUL, IDIV, all multicycle FP, all SIMD}.
    The long path was 1 cycle farther down the pipeline (wire delay) so
    we had those FUs send the result tags 1-cycle earlier, and had 2-stages
    of forwarding. K9's PRF was 160-odd registers, instead of an arbitrary
    40-for integer, 40-for FP, 40-for SIMD, and 40-for memory. It is the
    same reason you can put any kind of data in the data cache !!! The
    unified renamer ran out of registers a LOT less often than partitioned
    renamer.

    The long data path added less than 1% overhead to instruction execution
    latency (as measured on 5,000-odd traces of 1B instructions each from
    Opteron).


    An ISA can be implemented in different ways, and it can be used in
    different ways.

    Of course.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 20:04:28 2025
    On Thu, 17 Jul 2025 18:27:26 +0000, John Savard wrote:

    On Thu, 17 Jul 2025 18:03:25 +0000, John Savard wrote:

    As you noted that having multiple register files makes context switching
    slower, though, I was simply noting that... one can always just ignore
    the extra register files.

    And speaking of ignoring features:

    You've noted that it doesn't really make sense to include Cray-style
    vector capabilities on a modern microprocessor, since they require
    large memory bandwidth to be effective.

    Since I'm designing the Concertina II ISA to support every need, I
    do include such capabilities in my Long Vector instructions.

    A vector register, in my architecture, consists of storage for
    64 double-precision floating-point numbers. Those are certainly a
    big pain to save and restore.

    The problem with CRAY-like vector architectures is that you have
    to fundamentally build the memory units to send 2 new fetches
    and 1 write beyond L1 cache every cycle. That is the L2 has to
    service 2 LD misses and 1 ST miss every cycle continuously for-
    ever !!! and so does the Bus interface !!! and so does DRAM !!!

    If you can do the above, the overhead of LDin and STing the file
    becomes inconsequential.

    Few can afford the costs to build a CRAY-like vector machine
    more due to pin counts, interconnect, and DRAM memory; than CPU
    internals.

    So what I envisage is, even in implementations of the ISA which don't
    subset out the long vector feature, there will only be one set of
    long vector registers in any core, and the operating system will
    disable the feature by default. Programs will have to request use of
    the long vector capability, and those programs will, basically, run
    in batch mode (they can still communicate with the user, if they take
    over the computer, i.e. as a video game) - just one such program runs
    at a time, to reduce the need for saving and restoring all those
    registers.

    What do you do with program that request said feature and it is NOT
    PRESENT !! (That is compatible up and down)

    The ISA will include a number of features with the characteristic that
    they are capable of improving performance when used appropriately - and
    that they will massively degrade performance if used badly.

    I consider that sort of thing the programmer's problem.

    Like putting a driver who has never experienced anything but a road
    car in a rally car where each circuit is on its own switch and relay
    so the (crashed) car can still finish the race.

    I'm looking more
    at designing an ISA that can be used to make supercomputers than an ISA
    that is idiot proof.

    You cannot make anything that is Turing complete that is also idiot proof--idiots are more clever than that.

    Particularly as an old saying claims that doing the latter is a losing battle: they'll always design better idiots.

    Idiots will stumble upon ways you never thought of.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Stephen Fuld on Thu Jul 17 20:32:44 2025
    On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

    Mitch has made the case for a combined integer/floating point register
    set. What is your case for having them separate?

    I had thought the _current_ conversation was about how context switching
    was made more painful by my additional sets of 128 registers.

    Having separate integer and floating-point registers?

    - that is what nearly everyone else does;

    - integers and floating-point numbers are different in format, so
    it is not useful to perform operations meant for one type of number
    on the other type of number;

    - the opcode indicates whether an operation is an integer operation
    or a floating-point operation, so having these two sets of registers
    lets you have twice as many registers without having to add a bit
    to the register fields in the instruction.

    The third point, of course, is the only _real_ advantage.

    *And* my architecture is specified as performing a transformation on floating-point numbers during load and store operations to make register-to-register arithmetic faster.

    So I'm taking advantage of the need for separate floating-point
    load and store instructions to derive a performance gain from
    them. (The problem is that the hidden first bit and denormals,
    if properly handled, only involve a small number of gate delays,
    so this is unlikely to produce a genuine advantage.)

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Thu Jul 17 21:00:36 2025
    On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:

    Few can afford the costs to build a CRAY-like vector machine more due to
    pin counts, interconnect, and DRAM memory; than CPU internals.

    I do remember seeing a photo of the CPU from the NEC SX6;
    it was connected to a memory bus that was sixteen times as
    wide as a normal PC memory bus.

    Using dual-channel or quad-channel memory on high-performance
    PCs is quite conventional these days.

    So, while I agree that doing what the SX6 did is not
    likely to be feasible except in extreme cases, splitting
    the difference and using an eight-channel interface is
    something I suspect would be doable.

    Thus, I figure that *half* the performance of a NEC Aurora
    TSUBASA would already be good enough to be a big improvement
    over conventional microprocessors.

    Initially, before having your input, the reason I started putting
    in a CRAY-like vector feature in the original Concertina design,
    aside from simply wanting to explain how it worked, and how it
    differed from modern SIMD vector designs, was that...

    microprocessors seemed to have evolved from 8-bit designs to
    minicomputer-like designs to mainframe-like designs. The
    Pentium Pro and Pentium II took this to the limit of mainframe-like
    designs, by attaining a performance-oriented design strongly
    resembling the IBM System/360 Model 195.

    What else did computers, prior to the microcomputer era do to be
    more powerful? What else remained for further progress? Well, there
    was _one_ old computer that went beyond the 195... the CRAY-I and
    those which followed it.

    So it seemed like there was still one step to take in making individual
    cores more powerful before going to the expedient of putting a
    parallel sysplex system on a chip (IBM-speak for multicore).

    That was naive reasoning, of course, so I can't really mount a
    strong counterargument against your claims that this is not workable.

    After all, although NEC _has_ managed to continue to make a
    vector supercomputer design for today, it's in a video card style
    form factor instead of being a CPU that sits directly on top of
    a motherboard.

    What do you do with program that request said feature and it is NOT
    PRESENT !! (That is compatible up and down)

    Obviously, there is _nothing_ that can be done with programs that
    require a feature that isn't implemented which is fully
    upwards and downwards compatible. For full compatibility, every
    feature must be implemented.

    The Concertina II ISA, however, is, as has been noted, rather
    bloated. So I don't see it as at all unreasonable to divide the
    architecture into a basic architecture, which all programs can
    expect to have available, and specialized features which are
    only present on special-purpose implementations.

    Just as you wouldn't write code for a Cray, or a TMS320C6000,
    and expect it to run on an IBM 360, the special-purpose
    implementations of Concertina II are different enough in
    function that they should be regarded as different machines -
    even if they are also able to run standard programs for the
    architecture.

    The option, though, is also present to implement all features
    for compatibility, but implement some features badly. So
    some register files are too enormous to put on the chip? Fine;
    put them in RAM!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Thu Jul 17 21:29:57 2025
    On Thu, 17 Jul 2025 21:00:36 +0000, John Savard wrote:

    On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:

    Few can afford the costs to build a CRAY-like vector machine more due to
    pin counts, interconnect, and DRAM memory; than CPU internals.

    I do remember seeing a photo of the CPU from the NEC SX6;
    it was connected to a memory bus that was sixteen times as
    wide as a normal PC memory bus.

    SX6 had 256-way interleaved memory with 2-stages of 16-ways each
    both outgoing and incoming.

    IIRC it was 4-lanes wide, which means each CPU port (2LD and 1ST)
    were equivalent to 8 LDs per cycle and 4 STs per cycle, and it
    did not have a data cache.

    Using dual-channel or quad-channel memory on high-performance
    PCs is quite conventional these days.

    Fully 2 decimal orders of magnitude too small.

    So, while I agree that doing what the SX6 did is not
    likely to be feasible except in extreme cases, splitting
    the difference and using an eight-channel interface is
    something I suspect would be doable.

    Thus, I figure that *half* the performance of a NEC Aurora
    TSUBASA would already be good enough to be a big improvement
    over conventional microprocessors.

    Initially, before having your input, the reason I started putting
    in a CRAY-like vector feature in the original Concertina design,
    aside from simply wanting to explain how it worked, and how it
    differed from modern SIMD vector designs, was that...

    microprocessors seemed to have evolved from 8-bit designs to minicomputer-like designs to mainframe-like designs. The
    Pentium Pro and Pentium II took this to the limit of mainframe-like
    designs, by attaining a performance-oriented design strongly
    resembling the IBM System/360 Model 195.

    Leaving "mainframes" to support features micros did not::
    mainly things like RAS, hot-plug, and uptime measured in
    years to decades without a crash.

    What else did computers, prior to the microcomputer era do to be
    more powerful? What else remained for further progress? Well, there
    was _one_ old computer that went beyond the 195... the CRAY-I and
    those which followed it.

    CRAY-like is the end of the In-Order evolution path*. Instructions
    are in-order, calculations are in-order/FU, memory is in-order/bank;
    and the rest is your problem.

    (*) discounting Mill.

    CDC 6600 was the beginning of the OoO path (discounting Stretch)
    but /91 introduce the technology to follow forward (Tomasulo).
    Patterson (et. al) developed Reorder Buffer and we have the modern
    design paradigm.

    The rest was widening the various paths (FETCH, DECODE, Execution Lanes)
    and making caches as big as cycle times could afford.

    So it seemed like there was still one step to take in making individual
    cores more powerful before going to the expedient of putting a
    parallel sysplex system on a chip (IBM-speak for multicore).

    That was naive reasoning, of course, so I can't really mount a
    strong counterargument against your claims that this is not workable.

    After all, although NEC _has_ managed to continue to make a
    vector supercomputer design for today, it's in a video card style
    form factor instead of being a CPU that sits directly on top of
    a motherboard.

    Nor does it need 25 tons of cooling.

    What do you do with program that request said feature and it is NOT
    PRESENT !! (That is compatible up and down)

    Obviously, there is _nothing_ that can be done with programs that
    require a feature that isn't implemented which is fully
    upwards and downwards compatible. For full compatibility, every
    feature must be implemented.

    The Concertina II ISA, however, is, as has been noted, rather
    bloated. So I don't see it as at all unreasonable to divide the
    architecture into a basic architecture, which all programs can
    expect to have available, and specialized features which are
    only present on special-purpose implementations.

    Just as you wouldn't write code for a Cray, or a TMS320C6000,
    and expect it to run on an IBM 360, the special-purpose
    implementations of Concertina II are different enough in
    function that they should be regarded as different machines -
    even if they are also able to run standard programs for the
    architecture.

    The option, though, is also present to implement all features
    for compatibility, but implement some features badly. So
    some register files are too enormous to put on the chip? Fine;
    put them in RAM!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Thu Jul 17 21:26:52 2025
    On Thu, 17 Jul 2025 21:00:36 +0000, John Savard wrote:
    On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:

    Few can afford the costs to build a CRAY-like vector machine more due
    to pin counts, interconnect, and DRAM memory; than CPU internals.

    Initially, before having your input, the reason I started putting in a CRAY-like vector feature in the original Concertina design,
    aside from simply wanting to explain how it worked, and how it differed
    from modern SIMD vector designs, was that...

    microprocessors seemed to have evolved from 8-bit designs to minicomputer-like designs to mainframe-like designs. The Pentium Pro and Pentium II took this to the limit of mainframe-like designs, by
    attaining a performance-oriented design strongly resembling the IBM System/360 Model 195.

    What else did computers, prior to the microcomputer era do to be more powerful? What else remained for further progress? Well, there was _one_
    old computer that went beyond the 195... the CRAY-I and those which
    followed it.

    And I should mention another reason why I seem to be resistant to
    heeding your undoubtedly good advice.

    The very reason the CRAY-I succeeded where the STAR-100 failed was that
    the CRAY-I was built around vector registers, and it did its
    arithmetic between those vector registers, only loading data into
    them from, and writing data out from them to, the main memory.

    In true RISC fashion.

    So the CRAY-I was a design *specifically intended* to make wise and
    sparing use of limited memory bandwidth.

    So when you come along and tell me that a CRAY-I style design can't
    possibly work without ginormous memory bandwidth, something in my
    gut just rebels at the very thought.

    Of course, the above does _not_ mean that you're wrong. The
    CRAY-I and the STAR 100 were both machines from the old days,
    before today's wide disparity between CPU speeds and memory
    speeds emerged. So if the CRAY-I managed to just sneak under the
    bar of what was needed to make memory *then* adequate to feed
    CPUs *then*, while the STAR 100 failed at that... then it _is_
    entirely reasonable to conclude that memory *now* is vastly
    inadequate to feed even a CRAY-I style CPU *now*, even if a
    STAR 100 style machine would be even worse.

    And the fact that a modern microprocessor can have a cache which
    is as big as the entire main memory of the CRAY-I is... not a
    defense when you consider that a smartphone today has more FLOPS
    than a CRAY-I did. In order for a CRAY-I like design to actually
    improve on the performance of a desktop CPU, it actually does
    have to have a balanced memory design.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Thu Jul 17 21:45:54 2025
    On Thu, 17 Jul 2025 21:29:57 +0000, MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 21:00:36 +0000, John Savard wrote:

    Using dual-channel or quad-channel memory on high-performance PCs is
    quite conventional these days.

    Fully 2 decimal orders of magnitude too small.

    So, while I agree that doing what the SX6 did is not likely to be
    feasible except in extreme cases, splitting the difference and using an
    eight-channel interface is something I suspect would be doable.

    Obviously, "twice as fast" isn't enough if 100 times as fast is what
    is needed.

    I was coming back to mention, hey, we've also got HBM, but
    presumably even that will fall short. Although HBM is presumably
    what the NEC SX-Alpha TSUBASA is doing.

    Their cards have either 24 or 48 gigabytes of internal memory;
    24 is for development systems, 48 for real work.

    What I would see as "doable" is 8 or 16 gigabytes of HBM on a
    chip module with an eight-channel interface to main memory.

    That would not be cheap. That would not match the performance
    of a NEC SX-Alpha TSUBASA. But it would outdo an ordinary
    desktop CPU.

    Also, in reply to another comment you made which I didn't
    quote:

    Obviously, there is no such thing as a _rename_ vector register,
    so, yes, the vector engine would be in-order. One of the big
    strengths of the CRAY-I, as opposed to the STAR 100 and its
    other competition at the time was that Seymour Cray understood
    Amdahl's Law well enough to know that he couldn't neglect the
    performance of the scalar part of his vector machine.

    So the fact that an in-order CRAY-I style vector unit would be
    bolted on to a GBOoO scalar CPU is... accepted as inevitable,
    rather than seen as a contradiction, at least by this unworthy
    one.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Thu Jul 17 21:58:23 2025
    On Thu, 17 Jul 2025 21:45:54 +0000, John Savard wrote:

    So the fact that an in-order CRAY-I style vector unit would be bolted on
    to a GBOoO scalar CPU is... accepted as inevitable, rather than seen as
    a contradiction, at least by this unworthy one.

    And *if*, as I very strongly doubt, there was some reason why an
    out-of-order scalar unit would not mesh well with an in-order
    vector unit, with pipeline delays or something keeping the vector
    unit waiting too long for scalar results...

    have you forgotten that the Concertina II ISA _also_ includes those
    register banks with 128 registers in them, and those VLIW block formats
    with break bits and instruction predication, so as to attempt to
    approach OoO levels of performance within in-order designs?

    I am not really attempting to compete with Mill. I don't consider
    myself at all qualified to do so, and I am in awe at its amazing
    originality. What Concertina II does instead is approach in-order
    performance through well-known conventional means, without any
    originality.

    So if CRAY-I style vectors need high-performance *in-order* scalar
    arithmetic to complement them for some obscure reason of which I
    am not aware... Concertina II is ready!

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Savard on Thu Jul 17 21:35:31 2025
    On 7/17/2025 1:32 PM, John Savard wrote:
    On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

    Mitch has made the case for a combined integer/floating point register
    set. What is your case for having them separate?

    I had thought the _current_ conversation was about how context switching
    was made more painful by my additional sets of 128 registers.

    Sorry, you are right. So you now have four sets of registers (integer,
    float, SIMD, additional)?

    That's a lot of chip area and wiring.


    Having separate integer and floating-point registers?

    - that is what nearly everyone else does;

    So if everyone else jumped off a roof????


    - integers and floating-point numbers are different in format, so
    it is not useful to perform operations meant for one type of number
    on the other type of number;

    Not necessarily. There are things that one wants to do to floating
    point numbers that are "integer register like", such as extract the
    exponent. I think in a related post, someone (Mitch?) gave a more
    complete list. So you either have to provide extra instructions to do
    these things on the FP registers, or suffer the cost of instructions to
    move the value from the FP registers to the integer registers.

    - the opcode indicates whether an operation is an integer operation
    or a floating-point operation, so having these two sets of registers
    lets you have twice as many registers without having to add a bit
    to the register fields in the instruction.

    True. The question is, how valuable those extra registers are? If you
    already have 32 integer registers, isn't that enough for almost every
    purpose?


    The third point, of course, is the only _real_ advantage.

    *And* my architecture is specified as performing a transformation on floating-point numbers during load and store operations to make register-to-register arithmetic faster.

    Yes, I had forgotten about that.


    So I'm taking advantage of the need for separate floating-point
    load and store instructions to derive a performance gain from
    them. (The problem is that the hidden first bit and denormals,
    if properly handled, only involve a small number of gate delays,
    so this is unlikely to produce a genuine advantage.)

    OK, so scratch that advantage. :-(



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Stephen Fuld on Fri Jul 18 11:18:32 2025
    On Thu, 17 Jul 2025 21:35:31 -0700, Stephen Fuld wrote:
    On 7/17/2025 1:32 PM, John Savard wrote:

    Having separate integer and floating-point registers?

    - that is what nearly everyone else does;

    So if everyone else jumped off a roof????

    Then it would be more obvious they were making a mistake. Usually, if not always, what most people do, they do for a reason. And usually, if not
    always, that reason is actually good.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Stephen Fuld on Fri Jul 18 11:28:56 2025
    On Thu, 17 Jul 2025 21:35:31 -0700, Stephen Fuld wrote:

    That's a lot of chip area and wiring.

    Yes, but then so is out-of-order execution. Having sets of 128 registers
    as register banks gives designers the option of dropping OoO and still
    having performance.

    I mean, that was the idea behind the Itanium!

    Oh, sorry, that may not be a recommendation. :)

    OK, so scratch that advantage. :-(

    Not so fast. A small number of gate delays may not seem like much.
    But if one is desperate to make floating-point operations as fast
    as possible in any way one can (I also plan to drop the demand of the
    IEEE 754 standard that division always produce the best rounded
    result, in order to speed up floating-point division by methods such
    as Goldschmidt and Newton-Raphson) then, depending on how many gate
    delays per cycle the rest of one's design leads to, those few gate
    delays just might shave off a cycle somewhere.

    "desperate to make floating-point operations as fast as possible"?

    Yes, I'm channelling the ghost of Seymour Cray in more than one way.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to John Savard on Fri Jul 18 11:33:07 2025
    On Fri, 18 Jul 2025 11:28:56 +0000, John Savard wrote:

    (I also plan to drop the demand of the IEEE
    754 standard that division always produce the best rounded result, in
    order to speed up floating-point division by methods such as Goldschmidt
    and Newton-Raphson)

    So I don't _always_ do what everyone else does.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Fri Jul 18 15:12:27 2025
    On Fri, 18 Jul 2025 4:35:31 +0000, Stephen Fuld wrote:

    On 7/17/2025 1:32 PM, John Savard wrote:
    On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

    Mitch has made the case for a combined integer/floating point register
    set. What is your case for having them separate?

    I had thought the _current_ conversation was about how context switching
    was made more painful by my additional sets of 128 registers.

    Sorry, you are right. So you now have four sets of registers (integer, float, SIMD, additional)?

    That's a lot of chip area and wiring.


    Having separate integer and floating-point registers?

    - that is what nearly everyone else does;

    So if everyone else jumped off a roof????

    Where is Jim Jones when we need him ?!?

    - integers and floating-point numbers are different in format, so
    it is not useful to perform operations meant for one type of number
    on the other type of number;

    Not necessarily. There are things that one wants to do to floating
    point numbers that are "integer register like", such as extract the
    exponent. I think in a related post, someone (Mitch?) gave a more

    Yes, it was me.

    complete list. So you either have to provide extra instructions to do
    these things on the FP registers, or suffer the cost of instructions to
    move the value from the FP registers to the integer registers.

    Multiply by power of 2 is integer add to exponent
    Divide by power of 2 is integer subtract from exponent
    Copysign is essentially a MOV where 1 bit comes from S1 the rest come
    ... S2
    Exponent is Shift down by fraction size, mask off sign, and debias
    Fraction is mask of sign and exponent and create hidden (expon!=0)
    Split is mask off 1/2 the fraction bits and FSUB

    You see there are a lot of logical and shifting going on here,
    and a bit of integer.

    So if you have all the logical, integer + and - and shifts--you
    basically
    have a complete integer FU. So, then, you are in the position to use
    the FP file as an extended integer file. And at that point, what did
    having a separate file BUY ?!? What you really have is a unified FP
    file and a degenerate integer file.

    AND THEN there is IMUL and IDIV--these are fairly easy to perform
    over in the FMAC, so now you need a path from IRF to FMAC, and
    you end up doing almost everything in the FPU !!! So, why do this
    to yourself ??

    AND THEN there is context switching ,...
    AND THEN there is the OS wanting to use SIMD for page-sized MOVes...

    - the opcode indicates whether an operation is an integer operation
    or a floating-point operation, so having these two sets of registers
    lets you have twice as many registers without having to add a bit
    to the register fields in the instruction.

    True. The question is, how valuable those extra registers are? If you already have 32 integer registers, isn't that enough for almost every purpose?


    The third point, of course, is the only _real_ advantage.

    *And* my architecture is specified as performing a transformation on
    floating-point numbers during load and store operations to make
    register-to-register arithmetic faster.

    K6 and K7 (Athlon) did those kinds of things--we took that out in
    Opteron due to lots of boundary conditions (especially MMX and SSE
    stuff).

    Yes, I had forgotten about that.


    So I'm taking advantage of the need for separate floating-point
    load and store instructions to derive a performance gain from
    them. (The problem is that the hidden first bit and denormals,
    if properly handled, only involve a small number of gate delays,
    so this is unlikely to produce a genuine advantage.)

    You will be surprised during verification.

    OK, so scratch that advantage. :-(



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to John Savard on Fri Jul 18 15:39:58 2025
    John Savard <[email protected]d> wrote:
    On Thu, 17 Jul 2025 20:04:28 +0000, MitchAlsup1 wrote:

    Few can afford the costs to build a CRAY-like vector machine more due to
    pin counts, interconnect, and DRAM memory; than CPU internals.

    I do remember seeing a photo of the CPU from the NEC SX6;
    it was connected to a memory bus that was sixteen times as
    wide as a normal PC memory bus.

    Using dual-channel or quad-channel memory on high-performance
    PCs is quite conventional these days.

    So, while I agree that doing what the SX6 did is not
    likely to be feasible except in extreme cases, splitting
    the difference and using an eight-channel interface is
    something I suspect would be doable.

    Thus, I figure that *half* the performance of a NEC Aurora
    TSUBASA would already be good enough to be a big improvement
    over conventional microprocessors.

    IIUC on modern machine AVX instructions can saturate L1 cache.
    And when you do multiply-and-add they use all available power.
    So, this looks very well balanced, to get any improvement in
    performance you need both more power efficient execution units
    (which probably means lowering the clock, increasing latency and
    compensating by having more execution units) and more L1 bandwidth.
    More L1 bandwidth probably also means more latency. One can
    probably get some extra performance when computation fits within
    register set. But current AVX register set is probably limit
    of what is possible: more registers set means more latency on
    access, bigger registers need more wires and more area which
    means longer wires which also leads to slower access.

    Once your computation needs data from deeper level of cache
    hierarchy you have plenty of compute power but are limited by
    data access (both bandwidth and latency).

    So, simply there is no way long vector register could help.

    You could try some GPU tricks, but usefuly blending GPU and
    CPU looks tricky: GPU wants to run at relativly low clock
    frequency, slowing down CPU clock would severly limit single
    thread performance. High-end GPU-s use special memory which
    offers higher bandwidth, but has lower capacity than normal
    memory.

    What do you do with program that request said feature and it is NOT
    PRESENT !! (That is compatible up and down)

    Obviously, there is _nothing_ that can be done with programs that
    require a feature that isn't implemented which is fully
    upwards and downwards compatible. For full compatibility, every
    feature must be implemented.

    The Concertina II ISA, however, is, as has been noted, rather
    bloated. So I don't see it as at all unreasonable to divide the
    architecture into a basic architecture, which all programs can
    expect to have available, and specialized features which are
    only present on special-purpose implementations.

    Just as you wouldn't write code for a Cray, or a TMS320C6000,
    and expect it to run on an IBM 360, the special-purpose
    implementations of Concertina II are different enough in
    function that they should be regarded as different machines -
    even if they are also able to run standard programs for the
    architecture.

    Well, you spend a lot of effort to put all those instructions
    into a single instruction set. But then you split into
    mutually incompatible subsets. It looks that designing
    your instruction subsets as semi independent would allow better
    design for each individual subset. What is a gain from
    single unified instruction set if you do not want to
    implement all of it, but only subsets?

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to John Savard on Fri Jul 18 08:46:43 2025
    On 7/17/2025 1:32 PM, John Savard wrote:
    On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

    Mitch has made the case for a combined integer/floating point register
    set. What is your case for having them separate?

    I had thought the _current_ conversation was about how context switching
    was made more painful by my additional sets of 128 registers.

    Having separate integer and floating-point registers?

    - that is what nearly everyone else does;

    Although I gave a flip, and snarky, response to this before, I got to
    thinking about why it is true. I came up with the following, though it
    it a quick and dirty answer and I welcome other's comments and corrections.

    First of all, it isn't true. Going back to the mainframe era, AFAIK, of
    the major manufacturers, only IBM (S/360) had separate FP registers,
    Univac, CDC and both Burroughs architectures did not.

    I posit that the driving factor in the decision to have separate FP
    registers was the decision to make FP instructions an optional feature,
    i.e. an optional feature in S/360, part of the basic architecture in the others. Apparently, adding FP operations as a separate feature made
    using the existing registers just too hard to implement.

    I can't comment on the mini-computer era, as I don't know enough about
    the various architectures and marketing strategies. But in the early microprocessor era, it was clear that due to chip limitations, FP had to
    be on a separate chip e.g. 8087, and the cost of crossing a chip
    boundary several times for each FP instruction, which would have been
    needed if there were no on FPU separate registers, would not have been practical. Backward compatibility led to this decision being
    promulgated to future generations. I guess Intel could have changed
    when they added the non 8087 FP instructions, but by then the mind was
    set. Also, the X86 had fewer "normal" registers, so the argument about
    more registers had some weight. With new, clean sheet designs this is
    much less of an issue. And even for clean sheet designs, if it is
    desired to have FP an optional feature, that feature would not be a
    separate chip, thus eliminating that motivation for separate registers

    So I believe that the arguments for separate FP registers, while once
    valid, are no longer so.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to [email protected] on Fri Jul 18 16:10:29 2025
    MitchAlsup1 <[email protected]> wrote:
    On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

    Well, then, the extended register banks with 128 registers in them...

    AMD 29K ?!?

    are there to be used by programs intended to run on chips that don't
    have OoO. If an implementation does have OoO, nothing is to be gained
    by bothering with those registers (which still won't have rename
    registers associated with them, even in an OoO implementation; so
    OoO won't work on the parts of the program that use them).

    Somewhat right:: given a pool of 128 (or 256) rename registers,
    one can make even an 8-register machine run fast and rather
    well. The thing is that a 32-register ISA has a 15%-18% speed
    advantage over an 8-register machine--whereas a 64 register
    machine only has a 3% speed advantage over a 32 register
    machine (MIPS circa 1982). At some point not having enough
    registers hurts (that somewhere is in the mid 20-s of registers)
    and at some point the size of the RF limits read performance
    (that somewhere is between 32 registers and 64 registers).

    Hmm, IIUC for read performance what matters is physical reqister
    size. So if machine has 128 physical registers and 64 architectural
    requaters read access time should be essentially the same as
    machine which has 128 physical registers and 8 architectural
    registers. So with renaming cost of large architectural
    register set is in different place than access to register
    file. I guess that one cost of larger register set is
    space taken by register specification in instructions.
    There is cost in renamer (but it is not clear to me how
    significanat this is). With large architectural register
    set there may be more pressure on physical registers, because
    architectural register may longer keep dead values whithout
    CPU knowing this. There is cost for context switches and
    possible for function calls (more potentially live values
    which need saving).

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri Jul 18 12:29:18 2025
    MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 16:28:56 +0000, Anton Ertl wrote:

    John Savard <[email protected]d> writes:
    With this definition of the typical implementation in each size
    class of OoO, one can construct a mythical history of sorts.

    In "the beginning", CISC chips had normal OoO, and RISC chips
    did not have OoO.

    Very mythical.

    No VAX (*the* CISC at the time of RISC development) implementation had
    OoO, ever.

    No IA-32 (CISC) implementation had OoO up to October 31, 1995, i.e.,
    during the first 10 years of IA-32's existence.

    HP-PA (RISC) had an OoO implementation since November 2, 1995, i.e.,
    one day after IA-32.

    MIPS (RISC) had an OoO implementation since January 1996, i.e., two
    months after IA-32.

    Since the CISC chips had register files of 8
    registers,

    VAX (CISC) has 15 GPRs. S/360 (CISC) has 16 GPRs.

    and the RISC chips had register files of 32 registers,

    ARM A32 hast 16 registers.

    (Given cache misses,
    maybe the RISC chips still needed scoreboards.)

    Apart from MIPS R2000/R3000, every RISC has waited for results to
    become ready. In the beginning stopping the pipeline was a way to do
    this, but once more silicon became available and performance demands
    increased, other instructions often continued as far as possible.
    That was often called scoreboarding, but Mitch Alsup tells us that
    scoreboarding on the 66000 was something more sophisticated.

    CDC 6600 (no third 0) had a SB that scheduled the start of an
    instruction
    (at register read) and then later scheduled the completion of an inst- ruction (register write) in a way that prevented RAW, WAR, and WAW
    hazards.

    CDC6600 scoreboard gets rid of them by serializing conflicting register accesses to the same register. SB ordering respects dataflow dependencies
    not program ordering and so does not inherently enforce precise exceptions.

    6600 has no OoO load/store queue - its "stunt box" orders memory ops to
    the same location but it only works after addresses have been written
    into the 8 A registers. Since loading A registers would be data flow
    dependent (as decided by the scoreboard) and not program order dependent
    (as decided by an LSQ) I suspect one could break its memory ordering model
    by simply causing loads of the same address into different A registers
    to occur out of order with different address calculation delays.

    Renaming gets rid of the xAW hazards in SBs, Reservation Stations,
    Dispatch stacks, and others.

    Rename with a separate Retire allows accesses to be performed concurrently
    in any order while respecting precise exceptions.
    LSQ ensures memory ops appear in program order.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Waldek Hebisch@21:1/5 to [email protected] on Fri Jul 18 16:37:54 2025
    MitchAlsup1 <[email protected]> wrote:

    Point of order:: all register files that have the same width (64-bits)
    should be a single file. This makes varargs easier, allows using integer operations on FP operands (extract exponent, insert exponent, copysign)
    which are mandated by the standards. Either you have an integer set of registers and a FP set of registers and a nearly complete set of integer operations on FP registers, or you can dispense with the nonsense and
    have a single general purpose register file.

    I have evidence (data) indicating My 66000 with only 32-registers
    AND universal constants needs fewer registers than RISC-V with
    32 integer and 32 FP registers on many applications, including
    those you think need 32+32.

    Does this matter much for maching doing register renaming? IIUC
    machine may have separate architectural register sets but
    renamer can allocate registers from a single set. And opposite:
    machine may have unified architectural register set but renamer
    may allocate from separate sets, depending on instruction.

    --
    Waldek Hebisch

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Fri Jul 18 16:25:37 2025
    Stephen Fuld <[email protected]d> writes:
    On 7/17/2025 1:32 PM, John Savard wrote:
    On Thu, 17 Jul 2025 12:20:18 -0700, Stephen Fuld wrote:

    Mitch has made the case for a combined integer/floating point register
    set. What is your case for having them separate?

    I had thought the _current_ conversation was about how context switching
    was made more painful by my additional sets of 128 registers.

    Having separate integer and floating-point registers?

    - that is what nearly everyone else does;

    Although I gave a flip, and snarky, response to this before, I got to >thinking about why it is true. I came up with the following, though it
    it a quick and dirty answer and I welcome other's comments and corrections.

    First of all, it isn't true. Going back to the mainframe era, AFAIK, of
    the major manufacturers, only IBM (S/360) had separate FP registers,
    Univac, CDC and both Burroughs architectures did not.

    Burroughs medium systems had an accumulator register for floating
    point[*]. non-floating point arithmetic was done memory-to-memory
    without registers.

    [*] Starting with the B4700. The B3500 supported memory-to-memory
    floating point operations with up to 100 digits of mantissa (2 digit exponent).
    The B4700 accumulator supported 20 digit mantissa with 2-digit exponent,
    which was present in every subsequent generation through the V560.

    Burroughs large systems held all operands on the stack (although
    there were internal "registers" for top few stack elements).

    There were other burroughs processors (B1900, 220, B300, B800, B900, et alia), but I'm not familiar with the details thereof, although I believe the
    B900 series supported floating point. The B1900 had writable control store, which could be swapped at context switch, so each supported language had
    its own instruction set.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Waldek Hebisch on Fri Jul 18 17:08:57 2025
    On Fri, 18 Jul 2025 15:39:58 +0000, Waldek Hebisch wrote:

    What
    is a gain from single unified instruction set if you do not want to
    implement all of it, but only subsets?

    This is a valid point, but then I need to clarify one important thing.

    I don't want to implement _only_ subsets. Implementing a subset is simply
    an option. A very useful option for many applications. But implementing
    the whole instruction set for a desktop PC chip is appropriate.

    Also, if only subsets were implemented, programs written in the basic instruction set common to all the subsets would still run on all the
    chips, so that means that there is a gain even in the "only subsets" case.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Stephen Fuld on Fri Jul 18 17:14:08 2025
    On Fri, 18 Jul 2025 08:46:43 -0700, Stephen Fuld wrote:

    I posit that the driving factor in the decision to have separate FP
    registers was the decision to make FP instructions an optional feature,
    i.e. an optional feature in S/360, part of the basic architecture in the others. Apparently, adding FP operations as a separate feature made
    using the existing registers just too hard to implement.

    And assuming that to be the case, there's a smoking gun in the original System/360 architecture.

    The original System/360 had only four floating-point registers. These
    registers weren't numbered 0, 1, 2, and 3... and they weren't numbered
    0, 4, 8 and 12 either.

    They were numbered 0, 2, 4 and 6.

    The floating-point registers on the System/360 were 64 bits long, while
    the general registers were 32 bits long.

    This could suggest that the decision to make the floating-point unit
    an option, with its own set of registers, instead of using pairs of
    general registers for floating-point numbers, cama along late in the
    design process in that case.

    I can't comment on the mini-computer era, as I don't know enough about
    the various architectures and marketing strategies.

    Well, since minicomputers are smaller and cheaper, nearly all of them
    only had floating-point as an optional feature, except perhaps for
    some machines classed as superminis.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Stephen Fuld on Fri Jul 18 16:24:34 2025
    Stephen Fuld <[email protected]d> writes:
    On 7/17/2025 1:32 PM, John Savard wrote:
    First of all, it isn't true. Going back to the mainframe era, AFAIK, of
    the major manufacturers, only IBM (S/360) had separate FP registers,
    Univac, CDC and both Burroughs architectures did not.

    In the CDC 6600, the A and B registers correspond to GPRs (they
    support addresses), while the X registers correspond to FP registers;
    they may also support integer operations, but not addresses.

    I posit that the driving factor in the decision to have separate FP
    registers was the decision to make FP instructions an optional feature,
    i.e. an optional feature in S/360, part of the basic architecture in the >others. Apparently, adding FP operations as a separate feature made
    using the existing registers just too hard to implement.

    I can't comment on the mini-computer era, as I don't know enough about
    the various architectures and marketing strategies. But in the early >microprocessor era, it was clear that due to chip limitations, FP had to
    be on a separate chip e.g. 8087, and the cost of crossing a chip
    boundary several times for each FP instruction, which would have been
    needed if there were no on FPU separate registers, would not have been >practical.

    The 8087 was so slow that the cost of moving stuff over would have
    been only a small fraction of the total time. However, the 8086 has 8
    16-bit registers, not enough to hold even two 80-bit numbers for the
    8087.

    Backward compatibility led to this decision being
    promulgated to future generations. I guess Intel could have changed
    when they added the non 8087 FP instructions, but by then the mind was
    set.

    You mean SSE and SSE2? Note that the XMM registers of SSEx are 128
    bits in size, while the GPRs of IA-32 are 32 bits in size. And they
    also support integer operations, but not addresses.

    But yes, they could have expanded the GPRs to 128 bits, and let the
    SSE and SSE2 instructions on these registers.

    I think there are several reasons for having separate XMM registers:

    1) Less register pressure in code that uses SSEx instructions. And
    IA-32 does not have that many register names.

    2) The XMM registers and the FPUs can be located separately.

    3) Fewer register ports needed on each register file on superscalar implementations (i.e., all of them).

    Yes, it has its costs in having to partially duplicate some integer
    FUs, but they obviously thought that the benefits are worth it.

    Interestingly, XMM (128 bit), YMM (256 bit), and ZMM (512 bit)
    registers are not separated from each other.

    Let's look at some other cases:

    PA-RISC: FPU is not optional AFAIK, and integer multiplication at
    least in early implementations uses the FP multiplier (like on
    Willamette). PA-RISC started with a separate FP register set and soon
    extended it to 58 registers or so.

    88000: Started out with a unified register set (with FP doubles
    represented by two 32-bit registers), and acquired a separate FP
    register set with 80-bit registers in its second implementation 88110.

    Power: Separate registers from the start. This may also have to do
    with the first implementation being in several chips, with the FPU in
    one chip, and the FXU (integer and load/store) in another chip.

    Alpha: FP never was optional. Separate registers from the start, even
    though its predecessor VAX has unified registers.

    With new, clean sheet designs this is
    much less of an issue. And even for clean sheet designs, if it is
    desired to have FP an optional feature, that feature would not be a
    separate chip, thus eliminating that motivation for separate registers

    And yet Alpha and RISC-V wer designed with separate FP registers.

    One thing is interesting about IA-32/AMD64 FP, and likewise on ARM
    A32/T32 and ARM A64: The FP instruction sets of each are present in
    the respective 32-bit and 64-bit instruction sets, which in case of
    ARM differs a lot from the 32-bit instruction set.

    So I believe that the arguments for separate FP registers, while once
    valid, are no longer so.

    I think that

    1) the register pressure issue (for a given number of encoding bits)
    is still valid.

    2) Not sure if distance on the chip has become more or less of a
    problem in the last years.

    3) Register ports supposedly are still at a premium.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Stephen Fuld on Fri Jul 18 14:07:49 2025
    Stephen Fuld wrote:
    On 7/17/2025 11:03 AM, John Savard wrote:
    On Thu, 17 Jul 2025 14:49:16 +0000, MitchAlsup1 wrote:
    On Thu, 17 Jul 2025 3:42:08 +0000, John Savard wrote:

    You could have an operating system that neglects to save certain
    register files on interrupts, which means programs can't use them.
    (There's a precedent: the Commodore 64 didn't save the status bit for
    decimal mode,
    so user programs couldn't use that feature of the 6502.)

    Any system that is not completely transparent to interrupts is a pain in >>> the a$$ for user applications.

    In no way am I denying this.

    As you noted that having multiple register files makes context switching
    slower, though, I was simply noting that... one can always just ignore
    the extra register files. Making them useless, by not saving them
    and restoring them in the interrupt handler, forces user programs to
    avoid using those registers. And if they can't use them, they don't have
    to save and restore them, and so context switching is speeded up!

    John,
    Mitch has made the case for a combined integer/floating point register
    set. What is your case for having them separate?

    Note that the unified or separate architecture ISA registers and
    unified or separate physical registers can be turned into each other
    at the uArch level.

    It can have two rename banks for float and int but allocate physical
    registers from a single PRF, or a unified rename and allocate float
    instruction destination registers from a float PRF,
    and integer instructions from int PRF,
    and record in the unified rename which PRF each Rn resides in.

    One difference is if FP128 data items or larger are supported.
    The unified PRF would use 128 or more bits to store 64 bit ints
    so a large fraction of the PRF could be wasted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Anton Ertl on Fri Jul 18 11:40:30 2025
    On 7/18/2025 9:24 AM, Anton Ertl wrote:
    Stephen Fuld <[email protected]d> writes:
    On 7/17/2025 1:32 PM, John Savard wrote:
    First of all, it isn't true. Going back to the mainframe era, AFAIK, of
    the major manufacturers, only IBM (S/360) had separate FP registers,
    Univac, CDC and both Burroughs architectures did not.

    In the CDC 6600, the A and B registers correspond to GPRs (they
    support addresses), while the X registers correspond to FP registers;
    they may also support integer operations, but not addresses.

    Well, a different split of functions. That is essentially like the
    Univac 1108, which had arithmetic registers that supported integer and
    FP operations, and index registers for memory addressing. Isn't that
    sort of like the Mot 68000? I do believe that CDC's X registers support
    a full complement of integer operations.

    big snip of good points about Intel's various SIMD register implementations.

    Let's look at some other cases:

    PA-RISC: FPU is not optional AFAIK, and integer multiplication at
    least in early implementations uses the FP multiplier (like on
    Willamette). PA-RISC started with a separate FP register set and soon extended it to 58 registers or so.

    88000: Started out with a unified register set (with FP doubles
    represented by two 32-bit registers), and acquired a separate FP
    register set with 80-bit registers in its second implementation 88110.

    Power: Separate registers from the start. This may also have to do
    with the first implementation being in several chips, with the FPU in
    one chip, and the FXU (integer and load/store) in another chip.

    Alpha: FP never was optional. Separate registers from the start, even
    though its predecessor VAX has unified registers.

    Lots of good examples presenting counter evidence to my proposal. I
    think it would be interesting to understand the motivations for some of
    these.


    So I believe that the arguments for separate FP registers, while once
    valid, are no longer so.

    I think that

    1) the register pressure issue (for a given number of encoding bits)
    is still valid.

    Clearly this is a function of how many registers (and consequently
    encoding bits) you have in the base design. If you only have 8, clearly
    so. If you have 64, almost certainly not. The evidence Mitch presented
    showed that 32 is about the knee of the curve.>
    2) Not sure if distance on the chip has become more or less of a
    problem in the last years.

    3) Register ports supposedly are still at a premium.

    Good question, and beyond my abilities to answer.

    Thanks Anton It seems I need more information, particularly about your
    counter examples.



    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Fri Jul 18 19:39:43 2025
    On Fri, 18 Jul 2025 17:14:08 +0000, John Savard wrote:

    On Fri, 18 Jul 2025 08:46:43 -0700, Stephen Fuld wrote:

    I posit that the driving factor in the decision to have separate FP
    registers was the decision to make FP instructions an optional feature,
    i.e. an optional feature in S/360, part of the basic architecture in the
    others. Apparently, adding FP operations as a separate feature made
    using the existing registers just too hard to implement.

    And assuming that to be the case, there's a smoking gun in the original System/360 architecture.

    The original System/360 had only four floating-point registers. These registers weren't numbered 0, 1, 2, and 3... and they weren't numbered
    0, 4, 8 and 12 either.

    They were numbered 0, 2, 4 and 6.

    The floating-point registers on the System/360 were 64 bits long, while
    the general registers were 32 bits long.

    This could suggest that the decision to make the floating-point unit
    an option, with its own set of registers, instead of using pairs of
    general registers for floating-point numbers, cama along late in the
    design process in that case.

    I suggest emulation.

    I can't comment on the mini-computer era, as I don't know enough about
    the various architectures and marketing strategies.

    Well, since minicomputers are smaller and cheaper, nearly all of them
    only had floating-point as an optional feature, except perhaps for
    some machines classed as superminis.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Fri Jul 18 19:38:03 2025
    On Fri, 18 Jul 2025 17:08:57 +0000, John Savard wrote:

    On Fri, 18 Jul 2025 15:39:58 +0000, Waldek Hebisch wrote:

    What
    is a gain from single unified instruction set if you do not want to
    implement all of it, but only subsets?

    Having a complete ISA and then allowing a few (say 3) subsets prevents
    the various subsets from preventing all instructions to be encoded
    (or decoded) simultaneously--something it seems RISC-V is trying to
    clean
    up now.

    This is a valid point, but then I need to clarify one important thing.

    I don't want to implement _only_ subsets. Implementing a subset is
    simply
    an option. A very useful option for many applications. But implementing
    the whole instruction set for a desktop PC chip is appropriate.

    Implementing integer-only with no IMUL/IDIV makes for a tiny controller
    CPU that can do pointer chasing but not array indexing. Whether you want something like this (or not) is implementers choice. HLL code written
    in such a way that does not need/use IMUL/IDIV of FP,... will compile
    and run just fine (preserving the software hierarchy.

    Also, if only subsets were implemented, programs written in the basic instruction set common to all the subsets would still run on all the
    chips, so that means that there is a gain even in the "only subsets"
    case.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Fri Jul 18 19:45:19 2025
    On Fri, 18 Jul 2025 16:24:34 +0000, Anton Ertl wrote:

    Stephen Fuld <[email protected]d> writes:
    On 7/17/2025 1:32 PM, John Savard wrote:
    First of all, it isn't true. Going back to the mainframe era, AFAIK, of >>the major manufacturers, only IBM (S/360) had separate FP registers, >>Univac, CDC and both Burroughs architectures did not.

    In the CDC 6600, the A and B registers correspond to GPRs (they
    support addresses), while the X registers correspond to FP registers;
    they may also support integer operations, but not addresses.

    I posit that the driving factor in the decision to have separate FP >>registers was the decision to make FP instructions an optional feature, >>i.e. an optional feature in S/360, part of the basic architecture in the >>others. Apparently, adding FP operations as a separate feature made
    using the existing registers just too hard to implement.

    I can't comment on the mini-computer era, as I don't know enough about
    the various architectures and marketing strategies. But in the early >>microprocessor era, it was clear that due to chip limitations, FP had to
    be on a separate chip e.g. 8087, and the cost of crossing a chip
    boundary several times for each FP instruction, which would have been >>needed if there were no on FPU separate registers, would not have been >>practical.

    The 8087 was so slow that the cost of moving stuff over would have
    been only a small fraction of the total time. However, the 8086 has 8
    16-bit registers, not enough to hold even two 80-bit numbers for the
    8087.

    Backward compatibility led to this decision being
    promulgated to future generations. I guess Intel could have changed
    when they added the non 8087 FP instructions, but by then the mind was
    set.

    You mean SSE and SSE2? Note that the XMM registers of SSEx are 128
    bits in size, while the GPRs of IA-32 are 32 bits in size. And they
    also support integer operations, but not addresses.

    But yes, they could have expanded the GPRs to 128 bits, and let the
    SSE and SSE2 instructions on these registers.

    I think there are several reasons for having separate XMM registers:

    1) Less register pressure in code that uses SSEx instructions. And
    IA-32 does not have that many register names.

    2) The XMM registers and the FPUs can be located separately.

    3) Fewer register ports needed on each register file on superscalar implementations (i.e., all of them).

    Yes, it has its costs in having to partially duplicate some integer
    FUs, but they obviously thought that the benefits are worth it.

    Interestingly, XMM (128 bit), YMM (256 bit), and ZMM (512 bit)
    registers are not separated from each other.

    Let's look at some other cases:

    PA-RISC: FPU is not optional AFAIK, and integer multiplication at
    least in early implementations uses the FP multiplier (like on
    Willamette). PA-RISC started with a separate FP register set and soon extended it to 58 registers or so.

    88000: Started out with a unified register set (with FP doubles
    represented by two 32-bit registers), and acquired a separate FP
    register set with 80-bit registers in its second implementation 88110.

    This was MANDATED by Apple who then 1 month later jumped ship to PPC.

    Power: Separate registers from the start. This may also have to do
    with the first implementation being in several chips, with the FPU in
    one chip, and the FXU (integer and load/store) in another chip.

    Alpha: FP never was optional. Separate registers from the start, even
    though its predecessor VAX has unified registers.

    With new, clean sheet designs this is
    much less of an issue. And even for clean sheet designs, if it is
    desired to have FP an optional feature, that feature would not be a >>separate chip, thus eliminating that motivation for separate registers

    And yet Alpha and RISC-V wer designed with separate FP registers.

    One thing is interesting about IA-32/AMD64 FP, and likewise on ARM
    A32/T32 and ARM A64: The FP instruction sets of each are present in
    the respective 32-bit and 64-bit instruction sets, which in case of
    ARM differs a lot from the 32-bit instruction set.

    So I believe that the arguments for separate FP registers, while once >>valid, are no longer so.

    I think that

    1) the register pressure issue (for a given number of encoding bits)
    is still valid.

    Then so is reserving registers for the dynamic linker to use,
    reserving R0 as 0x0, placing IP in a GPR, ... mandating IMUL
    produce a double wide result ...

    2) Not sure if distance on the chip has become more or less of a
    problem in the last years.

    It is the product of need and frequency. If need is high and frequency
    is high its a big problem. If either are moderate, the problem basically vanishes.

    3) Register ports supposedly are still at a premium.

    A GBOoO machine gets 75%+ of its operands from the forwarding path...

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Fri Jul 18 19:47:49 2025
    John Savard <[email protected]d> schrieb:

    The very reason the CRAY-I succeeded where the STAR-100 failed was that
    the CRAY-I was built around vector registers, and it did its
    arithmetic between those vector registers, only loading data into
    them from, and writing data out from them to, the main memory.

    "The Seymour Cray Era of Supercomputers" attributes this
    in large parts to the fast scalar units Cray-1.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to Thomas Koenig on Sat Jul 19 02:51:08 2025
    On Fri, 18 Jul 2025 19:47:49 +0000, Thomas Koenig wrote:

    John Savard <[email protected]d> schrieb:

    The very reason the CRAY-I succeeded where the STAR-100 failed was that
    the CRAY-I was built around vector registers, and it did its arithmetic
    between those vector registers, only loading data into them from, and
    writing data out from them to, the main memory.

    "The Seymour Cray Era of Supercomputers" attributes this in large parts
    to the fast scalar units Cray-1.

    Yes, the fact that Cray knew Amdahl's Law, and realized that real-world programs weren't 100% vector, and so his machine would have to have fast
    scalar units is another reason his machines were very successful. In fact,
    I noted this in another post in this very thread, but it's such a gigantic thread that I can't blame you for not having read every post in it.

    Here, though, I focused on the vector registers, because they affected
    memory bandwidth.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)