• Making Lemonade (Floating-point format changes)

    From John Savard@21:1/5 to All on Sat May 11 21:44:45 2024
    I've made another long-overdue change in the Concertina II
    architecture on the page about 17-bit instructions.

    Since I describe the individual instructions there, with their opcodes
    and what they do, I've illustrated the floating-point formats of the architecture on that page.

    The good people in charge of the IEEE 754 standard had seen fit to
    define a standard 128-bit floating-point format which included a
    hidden first bit.

    This annoyed me greatly, because I was going to take the 8087's
    temporary real format, and extend the mantissa for my 128-bit format.

    I've decided that it's necessary to fully accept the 128-bit standard
    and support it in a consistent manner.

    Therefore, I have taken the following actions:

    I have dropped the option of supporting 80-bit temporary reals
    entirely, as they are now incompatible as an internal format.

    I have instead defined a 256-bit format for floats which does not have
    a hidden first bit, which looks like the old temporary reals, except
    that the exponent field is one bit wider.

    And in addition, just as the IBM 704 used two single-precision floats
    to make a double-precision float, and the IBM System/360 Model 85
    started using two double-precision floats to make an extended
    precision float... I've defined how the 256-bit internal format floats
    can be doubled up to make a 512-bit float.

    I'm not really sure such floating-pont precision is useful, but I do
    remember some people telling me that higher float precision is indeed
    something to be desired. Well, the IEEE 754 standard has forced my
    hand.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Sun May 12 13:46:28 2024
    John Savard <[email protected]d> schrieb:

    I have instead defined a 256-bit format for floats which does not have
    a hidden first bit, which looks like the old temporary reals, except
    that the exponent field is one bit wider.

    Why not the IEEE binary256 (interchange) format?

    https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format

    [...]

    I've defined how the 256-bit internal format floats
    can be doubled up to make a 512-bit float.

    Such floating point formats have very strange properties.
    For example, try defining epsilon so that 1.0+epsilon is the
    smallest number larger than 1.0...

    IBM just spent a lot of effort to move away from that for POWER.

    I'm not really sure such floating-pont precision is useful, but I do
    remember some people telling me that higher float precision is indeed something to be desired. Well, the IEEE 754 standard has forced my
    hand.

    How so?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From wolfgang kern@21:1/5 to John Savard on Sun May 12 15:30:40 2024
    On 12/05/2024 05:44, John Savard wrote:
    I've made another long-overdue change in the Concertina II
    architecture on the page about 17-bit instructions.

    Since I describe the individual instructions there, with their opcodes
    and what they do, I've illustrated the floating-point formats of the architecture on that page.

    The good people in charge of the IEEE 754 standard had seen fit to
    define a standard 128-bit floating-point format which included a
    hidden first bit.

    This annoyed me greatly, because I was going to take the 8087's
    temporary real format, and extend the mantissa for my 128-bit format.

    I've decided that it's necessary to fully accept the 128-bit standard
    and support it in a consistent manner.

    Therefore, I have taken the following actions:

    I have dropped the option of supporting 80-bit temporary reals
    entirely, as they are now incompatible as an internal format.

    I have instead defined a 256-bit format for floats which does not have
    a hidden first bit, which looks like the old temporary reals, except
    that the exponent field is one bit wider.

    And in addition, just as the IBM 704 used two single-precision floats
    to make a double-precision float, and the IBM System/360 Model 85
    started using two double-precision floats to make an extended
    precision float... I've defined how the 256-bit internal format floats
    can be doubled up to make a 512-bit float.

    I'm not really sure such floating-point precision is useful, but I do remember some people telling me that higher float precision is indeed something to be desired. Well, the IEEE 754 standard has forced my
    hand.

    YES, I'd use something similar:
    I never cared nor supported any odd 10 byte formats and I give a fart to
    all these weird IEEE standards.

    my OS and it's calculator support only 2^N data starting with 4bit
    nibbles up to 512 bit mantissa, both signed and unsigned and optionally
    added by an 2^N or an x^10 valued exponent (nibble based sizes).

    I finally (1998) made all numeric variable types QUAD-based (4,8,12,...)
    this made all calculation and input/output routines short/fast and give
    my clients the option to define their own formats (ie: 12+4, 28+4,...).

    I can also use BCD variables, but they belong to text here and were
    converted to binary on the fly when entered in calculations.
    __
    wolfgang

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Sun May 12 11:35:15 2024
    On Sun, 12 May 2024 13:46:28 -0000 (UTC), Thomas Koenig
    <[email protected]> wrote:
    John Savard <[email protected]d> schrieb:

    I have instead defined a 256-bit format for floats which does not have
    a hidden first bit, which looks like the old temporary reals, except
    that the exponent field is one bit wider.

    Why not the IEEE binary256 (interchange) format?

    https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format

    Oh, drat. I had not realized that they had also defined this.

    Now that means I need to make the exponent field in my internal
    format larger, define a 512-bit floating point number which is in the
    internal format, so that it can be unnormalized, and a 1024-bit
    doubled-up float... instead of what I just did!

    The enlarged exponent field won't make the internal form of the
    128-bit float go over 160 bits, so register allocation for th at won't change... but now I will have to figure out a scheme of register
    allocation applicable to the 256-bit floats!

    I am not amused.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to [email protected] on Sun May 12 12:15:10 2024
    On Sun, 12 May 2024 11:35:15 -0600, John Savard
    <[email protected]d> wrote:

    Now that means I need to make the exponent field in my internal
    format larger, define a 512-bit floating point number which is in the >internal format, so that it can be unnormalized, and a 1024-bit
    doubled-up float... instead of what I just did!

    The update has been made.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to John Savard on Sun May 12 20:34:00 2024
    In article <[email protected]>, [email protected]d (John Savard) wrote:

    I'm not really sure such floating-pont precision is useful, but I
    do remember some people telling me that higher float precision is
    indeed something to be desired.

    I would be in favour of 128-bit being available. I'm not sure my field
    has need for 256- or 512-bit, but that doesn't mean that nobody has.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Sun May 12 20:12:22 2024
    John Savard wrote:

    On Sun, 12 May 2024 13:46:28 -0000 (UTC), Thomas Koenig <[email protected]> wrote:
    John Savard <[email protected]d> schrieb:

    I have instead defined a 256-bit format for floats which does not have
    a hidden first bit, which looks like the old temporary reals, except
    that the exponent field is one bit wider.

    Why not the IEEE binary256 (interchange) format?

    https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format

    Oh, drat. I had not realized that they had also defined this.

    Now that means I need to make the exponent field in my internal
    format larger, define a 512-bit floating point number which is in the internal format, so that it can be unnormalized, and a 1024-bit
    doubled-up float... instead of what I just did!

    The enlarged exponent field won't make the internal form of the
    128-bit float go over 160 bits, so register allocation for th at won't change... but now I will have to figure out a scheme of register
    allocation applicable to the 256-bit floats!

    I am not amused.

    Question:: why are you all so gung-ho on having a format without a hidden bit. It is trivially easy to reconstruct::

    h = expon != 0;

    Taking little time of even gates; and is something you HAVE to do anyway.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Dallman on Sun May 12 20:55:03 2024
    John Dallman <[email protected]> schrieb:
    In article <[email protected]>, [email protected]d (John Savard) wrote:

    I'm not really sure such floating-pont precision is useful, but I
    do remember some people telling me that higher float precision is
    indeed something to be desired.

    I would be in favour of 128-bit being available.

    Me, too. Solving tricky linear systems, or obtaining derivatives
    numerically (for example for Jacobians) eats up a _lot_ of precision
    bits, and double precision can sometimes run into trouble.

    At least gcc and gfortran now support POWER's native 128-bit format
    in hardware. On other systems, software emulation is used, which
    is of course much slower.

    I'm not sure my field
    has need for 256- or 512-bit, but that doesn't mean that nobody has.

    I've finally found the time to play around with Julia in the last
    few weeks. One of the nice things does is that you can just use
    the same packages with different numerical types, for example for
    ODE integration. Just set up the problem as you would normally
    and supply an starting vector with a different precision.

    So, for doing some experiments on numerical data types, Julia
    is quite nice.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sun May 12 18:00:39 2024
    On Sun, 12 May 2024 20:12:22 +0000, [email protected] (MitchAlsup1)
    wrote:


    Question:: why are you all so gung-ho on having a format without a hidden bit. >It is trivially easy to reconstruct::

    h = expon != 0;

    Taking little time of even gates; and is something you HAVE to do anyway.

    The explanation here is, I am afraid, probably once again ignorance on
    my part. I knew that processing the hidden bit would take _some_ time
    and effort, since, after all, in the early days computers didn't use
    formats that had one.

    So I assumed that converting to an internal format without a hidden
    bit - even the 8087 did that - would yield a significant speed up.
    (And, as noted, I'm following Seymour Cray in sacrificing everything
    for speed, so even if the speedup is small, I am inclined to chase
    it.)

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Mon May 13 15:16:47 2024
    On Sun, 12 May 2024 20:55:03 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    John Dallman <[email protected]> schrieb:
    In article <[email protected]>, [email protected]d (John Savard) wrote:

    I'm not really sure such floating-pont precision is useful, but I
    do remember some people telling me that higher float precision is
    indeed something to be desired.

    I would be in favour of 128-bit being available.

    Me, too. Solving tricky linear systems, or obtaining derivatives
    numerically (for example for Jacobians) eats up a _lot_ of precision
    bits, and double precision can sometimes run into trouble.

    At least gcc and gfortran now support POWER's native 128-bit format
    in hardware. On other systems, software emulation is used, which
    is of course much slower.


    Much slower?
    I think, at least for matrix multiplication, my emulation on modern x86
    was within factor of 1.5x from your measurements on POWER9. And that
    despite rather poorly chosen ABI for support routines. With better ABI
    (pure integer, with no copies from/to XMM slowing things down, esp. on
    Zen3) I would expect it to be a wash.
    With slightly higher-level API, (qaxpy instead of individual mul/add)
    a software can actually pull ahead.

    I'm not sure my field
    has need for 256- or 512-bit, but that doesn't mean that nobody
    has.

    I've finally found the time to play around with Julia in the last
    few weeks. One of the nice things does is that you can just use
    the same packages with different numerical types, for example for
    ODE integration. Just set up the problem as you would normally
    and supply an starting vector with a different precision.

    So, for doing some experiments on numerical data types, Julia
    is quite nice.

    It's a pity that something like that is not available in GNU Octave.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Mon May 13 19:01:37 2024
    Michael S <[email protected]> schrieb:
    On Sun, 12 May 2024 20:55:03 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    John Dallman <[email protected]> schrieb:
    In article <[email protected]>,
    [email protected]d (John Savard) wrote:

    I'm not really sure such floating-pont precision is useful, but I
    do remember some people telling me that higher float precision is
    indeed something to be desired.

    I would be in favour of 128-bit being available.

    Me, too. Solving tricky linear systems, or obtaining derivatives
    numerically (for example for Jacobians) eats up a _lot_ of precision
    bits, and double precision can sometimes run into trouble.

    At least gcc and gfortran now support POWER's native 128-bit format
    in hardware. On other systems, software emulation is used, which
    is of course much slower.


    Much slower?
    I think, at least for matrix multiplication, my emulation on modern x86
    was within factor of 1.5x from your measurements on POWER9.

    I don't remember the exact timing, and it might be interesting to
    revisit that (also considering that the gfortran code for matmul is
    not optimized for 128-bit float and might have blown cache sizes,
    plus it would be fair to compare compiler vs. compiler and assembler
    vs. assembler).

    I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
    with one result per cycle, POWER10 has 12 to 13 cycles with two
    results per cycle.

    What can your code get on x86_64?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon May 13 21:16:48 2024
    BGB wrote:


    Emulation via traps is very slow, but typical for many ISA's is to just quietly turn the soft-float operations into runtime calls.

    I recall that MIPS could emulate a TLB table walk in something like
    19 cycles. That is:: a few cycles to get there, a hash table access,
    a check, a TLB install, and a few cycles to get back.

    On an x86 this would be at least 200 cycles just getting there and back.

    So, to revisit your statement::

    Emulation is slow when trap overhead is large and not-slow when trap overhead is small.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon May 13 23:25:25 2024
    BGB wrote:

    On 5/13/2024 4:16 PM, MitchAlsup1 wrote:
    BGB wrote:


    Emulation via traps is very slow, but typical for many ISA's is to
    just quietly turn the soft-float operations into runtime calls.

    I recall that MIPS could emulate a TLB table walk in something like
    19 cycles. That is:: a few cycles to get there, a hash table access,
    a check, a TLB install, and a few cycles to get back.

    On an x86 this would be at least 200 cycles just getting there and back.


    I guess there are different possibilities here...

    Trap cost can be reduced, say, by having banked registers.
    But, not so good with explicit save/restore and a large register file.


    For example, I can note that a MSP430 at 16MHz can service a 32kHz
    timer... (with a budget of 488 cycles per interrupt).


    But, my BJX2 core (at 50MHz) would have a harder time here, with around
    a 1.5k cycle budget...

    Then again, it is possible the per-interrupt overhead would go down
    slightly, since most likely the ISR stack will still be in the L1 cache between interrupts (and save/restore overhead should drop to ~ 100
    cycles in the absence of L1 misses).


    MSP430 had a slight advantage here (besides fewer registers) in that L1 misses are not a thing (so, memory access has constant latency).


    So, to revisit your statement::

    Emulation is slow when trap overhead is large and not-slow when trap
    overhead
    is small.

    Possible, but I would not expect trap overhead to be lower than runtime
    call overhead...

    Yes, of course, trapping can never be quite as inexpensive as a CALL/RET sequence.

    But it does not have to be much larger--just a little bit larger.


    Also (in my case):
    Debugging is rather annoying in cases where dealing with bugs appear/disappear/move around at random or with the slightest
    perturbation...

    You need better verification--Oh Wait ...

    But, given for the most part behavior is consistently buggy (and
    manifesting in seemingly the same ways) between both the emulator and
    Verilog implementation, this implies the causal factors are in software.

    I guess in this case, either I figure it out, or will need to again go
    back to cooperative scheduling. Seemingly, using preemptive scheduling
    and virtual memory at the same time is particularly unstable (programs
    tend to crash on startup or soon after).


    Also I may need to rework how page-in/page-out is handled (and or how IO
    is handled in general) since if a page swap needs to happen while IO is already in progress (such as a page-miss in the system-call process), at present, the OS is dead in the water (one can't access the SDcard in the middle of a different access to the SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults
    of the OS page fault handler.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue May 14 00:43:22 2024
    BGB wrote:

    On 5/13/2024 6:25 PM, MitchAlsup1 wrote:

    Also (in my case):
    Debugging is rather annoying in cases where dealing with bugs
    appear/disappear/move around at random or with the slightest
    perturbation...

    You need better verification--Oh Wait ...


    Not sure I understand what you mean by this.


    Some of these bugs are behaving very similar to some bugs I was battling against a while ago (but never properly debugged, the bug just sort of seemingly disappeared).

    What you need is 1M-1B instructions that test every known corner case
    so if a fix breaks something else you will find it almost instantaneously.
    {At 50 MHz 1B instructions is 50 seconds of run time.}

    In the semiconductor industry, we use this test suite in the simulators, emulators, on the test head, and on parts returned from the field. Also,
    it is under constant evolution simply so we don't let corner cases let
    bugs into sellable parts.

    Also I may need to rework how page-in/page-out is handled (and or how
    IO is handled in general) since if a page swap needs to happen while
    IO is already in progress (such as a page-miss in the system-call
    process), at present, the OS is dead in the water (one can't access
    the SDcard in the middle of a different access to the SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults
    of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it also abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
    So, having a GuestOS in a position it cannot deal with another page
    fault is no longer a hindrance:: GuestOS does not see that page fault;
    it is just handled and goes away.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Tue May 14 05:35:53 2024
    [email protected] (MitchAlsup1) writes:
    I recall that MIPS could emulate a TLB table walk in something like
    19 cycles. That is:: a few cycles to get there, a hash table access,
    a check, a TLB install, and a few cycles to get back.

    Which MIPS? R2000? R10000? Something else? Was this an inverted page
    table?

    On an x86 this would be at least 200 cycles just getting there and back.

    Which x86? 8086? 80186? 80286? These (maybe the 8088 and V20, too)
    are the only implementations that deserve to be called x86. If you
    mean some IA-32 or AMD64 implementations, which ones?

    Anyway, let's see how this works for the U74 (a RISC-V implementation
    which apparently uses trapping for unaligned loads); here we have a
    10M iteration loop with a payload that performs one load per
    iteration:

    [fedora-starfive:~/nfstmp/gforth-riscv:104544] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye'

    Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye':

    223805151 instructions:u # 0.70 insn per cycle
    318131306 cycles:u

    0.352533487 seconds time elapsed

    0.257061000 seconds user
    0.064265000 seconds sys


    [fedora-starfive:~/nfstmp/gforth-riscv:104545] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye'

    Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye':

    5329494415 instructions:u # 0.75 insn per cycle
    7149481783 cycles:u

    7.183239751 seconds time elapsed

    7.082298000 seconds user
    0.070121000 seconds sys

    So the unaligned access handling result in 511 additional instructions
    per load compared to an aligned access (so it obviously does the
    handling using some kind of trapping). Each unaligned access results
    in 683 additional cycles.

    So better use the unspecified MIPS, right? However, if the
    unspecified MIPS is an R2000, 19 cycles on a 12.5MHz R2000 cost
    1.52us, whereas 683 cycles on a 1000MHz U74 cost 0.683us (and I have
    heard that in the Visionfive V2 the U74 runs at 1500MHz).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Tue May 14 13:49:09 2024
    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and or how
    IO is handled in general) since if a page swap needs to happen while
    IO is already in progress (such as a page-miss in the system-call
    process), at present, the OS is dead in the water (one can't access
    the SDcard in the middle of a different access to the SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults
    of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it also
    abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the I/O device hardware directly and initiates DMA transactions to-or-from the
    guest OS directly. With the PCIe PRI (Page Request Interface), the
    guest DMA target pages don't need to be pinned by the hypervisor; the
    I/O MMU will interrupt the hypervisor to make the page present
    and pin it and the hardware will then do the DMA.


    So, having a GuestOS in a position it cannot deal with another page
    fault is no longer a hindrance:: GuestOS does not see that page fault;
    it is just handled and goes away.

    There are two levels of page faults - at the guest level, the
    guest handles everything. When the hypervisors supports
    multplexing multple guests on a core, it will only handle second
    level translation table faults.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Tue May 14 17:51:15 2024
    On Tue, 14 May 2024 13:49:09 GMT
    [email protected] (Scott Lurndal) wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and
    or how IO is handled in general) since if a page swap needs to
    happen while IO is already in progress (such as a page-miss in
    the system-call process), at present, the OS is dead in the
    water (one can't access the SDcard in the middle of a different
    access to the SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page
    faults of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it
    also abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the
    I/O device hardware directly and initiates DMA transactions
    to-or-from the guest OS directly. With the PCIe PRI (Page Request Interface), the guest DMA target pages don't need to be pinned by the hypervisor; the I/O MMU will interrupt the hypervisor to make the
    page present and pin it and the hardware will then do the DMA.


    Sounds like it could be problematic from real-time perspective.
    When I design PCIe devices, I sometimes have device-side FIFO
    sufficient for 2-5 times an expected worst-case PCIe latency, i.e.
    for 7-10 usec or so. In scenario, you describe, it could easily overflow
    for acquisition-type device or underflow for player-type device.

    Now, my devices are not intended to be plugged into visualized server,
    but I'd think that I am not the only designer that choses size of FIFOs
    by that sort of logic.


    So, having a GuestOS in a position it cannot deal with another page
    fault is no longer a hindrance:: GuestOS does not see that page
    fault; it is just handled and goes away.

    There are two levels of page faults - at the guest level, the
    guest handles everything. When the hypervisors supports
    multplexing multple guests on a core, it will only handle second
    level translation table faults.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Tue May 14 15:57:59 2024
    Michael S <[email protected]> writes:
    On Tue, 14 May 2024 13:49:09 GMT
    [email protected] (Scott Lurndal) wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and
    or how IO is handled in general) since if a page swap needs to
    happen while IO is already in progress (such as a page-miss in
    the system-call process), at present, the OS is dead in the
    water (one can't access the SDcard in the middle of a different
    access to the SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page
    faults of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it
    also abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the
    I/O device hardware directly and initiates DMA transactions
    to-or-from the guest OS directly. With the PCIe PRI (Page Request
    Interface), the guest DMA target pages don't need to be pinned by the
    hypervisor; the I/O MMU will interrupt the hypervisor to make the
    page present and pin it and the hardware will then do the DMA.


    Sounds like it could be problematic from real-time perspective.

    Not really. The device presents 'virtual functions' to the
    guest. The physical function (owned by the hypervisor) can
    assign adapter resources (rings, queues, CAMS, interrupt vectors) to the virtual function which are then 'owned' by the guest operating
    system.

    When I design PCIe devices, I sometimes have device-side FIFO
    sufficient for 2-5 times an expected worst-case PCIe latency, i.e.
    for 7-10 usec or so. In scenario, you describe, it could easily overflow
    for acquisition-type device or underflow for player-type device.

    We have SR-IOV devices that support hundreds of virtual functions
    each of which can handle packet traffic at line rate across multiple
    100Gb network ports.

    https://docs.kernel.org/networking/device_drivers/ethernet/marvell/octeontx2.html


    Now, my devices are not intended to be plugged into visualized server,
    but I'd think that I am not the only designer that choses size of FIFOs
    by that sort of logic.

    If you're building an SR-IOV device, you obviously need to build
    it to support the required workloads.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Tue May 14 16:00:33 2024
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:
    I recall that MIPS could emulate a TLB table walk in something like
    19 cycles. That is:: a few cycles to get there, a hash table access,
    a check, a TLB install, and a few cycles to get back.

    Which MIPS? R2000? R10000? Something else? Was this an inverted page
    table?

    R3000 and it was a hast table ~1MB in size.

    Would would have been a significant fraction (25%?)of the
    total memory available on a R3k based system in 1990.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Tue May 14 15:19:34 2024
    Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:
    I recall that MIPS could emulate a TLB table walk in something like
    19 cycles. That is:: a few cycles to get there, a hash table access,
    a check, a TLB install, and a few cycles to get back.

    Which MIPS? R2000? R10000? Something else? Was this an inverted page
    table?

    R3000 and it was a hast table ~1MB in size.

    On an x86 this would be at least 200 cycles just getting there and back.

    Which x86? 8086? 80186? 80286? These (maybe the 8088 and V20, too)
    are the only implementations that deserve to be called x86. If you
    mean some IA-32 or AMD64 implementations, which ones?

    Anyway, let's see how this works for the U74 (a RISC-V implementation
    which apparently uses trapping for unaligned loads); here we have a
    10M iteration loop with a payload that performs one load per
    iteration:

    [fedora-starfive:~/nfstmp/gforth-riscv:104544] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye'

    Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye':

    223805151 instructions:u # 0.70 insn per cycle
    318131306 cycles:u

    0.352533487 seconds time elapsed

    0.257061000 seconds user
    0.064265000 seconds sys


    [fedora-starfive:~/nfstmp/gforth-riscv:104545] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye'

    Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye':

    5329494415 instructions:u # 0.75 insn per cycle
    7149481783 cycles:u

    7.183239751 seconds time elapsed

    7.082298000 seconds user
    0.070121000 seconds sys

    So the unaligned access handling result in 511 additional instructions
    per load compared to an aligned access (so it obviously does the
    handling using some kind of trapping). Each unaligned access results
    in 683 additional cycles.

    Yes, but notice sys time hardly changes, so, RISC-V is performing the misaligned LD in user mode (2 context switches -- likely somewhat light weight).

    So better use the unspecified MIPS, right? However, if the
    unspecified MIPS is an R2000, 19 cycles on a 12.5MHz R2000 cost
    1.52us, whereas 683 cycles on a 1000MHz U74 cost 0.683us (and I have
    heard that in the Visionfive V2 the U74 runs at 1500MHz).

    Given at least the same cache footprint a 2GHz R3000 would still be
    in the 20-cycle range. {{That 19 cycle TLB reload is dependent on
    the handler and its table have a footprint in the cache(s).

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Tue May 14 17:48:49 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    [email protected] (MitchAlsup1) writes:
    I recall that MIPS could emulate a TLB table walk in something like
    19 cycles. That is:: a few cycles to get there, a hash table access,
    a check, a TLB install, and a few cycles to get back.

    Which MIPS? R2000? R10000? Something else? Was this an inverted page
    table?

    R3000 and it was a hast table ~1MB in size.

    Would would have been a significant fraction (25%?)of the
    total memory available on a R3k based system in 1990.

    I heard numbers in the 10% range, so we are within a factor of 2 in
    our memory. The smaller main memory was, the less can be co-resident
    and the smaller the effective table size.

    {{But my MIPS info is mostly 3rd hand}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue May 14 22:19:25 2024
    On Mon, 13 May 2024 19:01:37 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:
    On Sun, 12 May 2024 20:55:03 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    John Dallman <[email protected]> schrieb:
    In article <[email protected]>,
    [email protected]d (John Savard) wrote:

    I'm not really sure such floating-pont precision is useful, but
    I do remember some people telling me that higher float
    precision is indeed something to be desired.

    I would be in favour of 128-bit being available.

    Me, too. Solving tricky linear systems, or obtaining derivatives
    numerically (for example for Jacobians) eats up a _lot_ of
    precision bits, and double precision can sometimes run into
    trouble.

    At least gcc and gfortran now support POWER's native 128-bit format
    in hardware. On other systems, software emulation is used, which
    is of course much slower.


    Much slower?
    I think, at least for matrix multiplication, my emulation on modern
    x86 was within factor of 1.5x from your measurements on POWER9.

    I don't remember the exact timing, and it might be interesting to
    revisit that (also considering that the

    IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix multiplication benchmark running on a single POWER9 core.

    I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
    GHz) using my plug-in replacements for gcc __multf3/__addtf3 with the
    level of support for FP exceptions and rounding modes that, according
    to you, is sufficient for Fortran, but according to other gnu
    maintainers is insufficient for C. For matrix multiplication
    implemented with vector APIs ('multiply vector by scalar' and 'add
    vectors') on the same EPYC3 I got approximately 200 MFLOPS.

    gfortran code for matmul is
    not optimized for 128-bit float and might have blown cache sizes,

    That's possible, but unlikely to make a major impact.
    At 200 MFLOPS even L3 cache is not a bottleneck. And it's actually hard
    to code matrix multiplication so poorly that at least half of data
    wouldn't come from L1D/L2. I took a look at GFortran sources for matmul
    - they are not that bad.

    plus it would be fair to compare compiler vs. compiler and assembler
    vs. assembler).


    My routines are implemented in 'C' and compiled with gcc

    I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
    with one result per cycle, POWER10 has 12 to 13 cycles with two
    results per cycle.

    So, a bottleneck is somewhere else. May be, multiplication?


    What can your code get on x86_64?

    Se above.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Wed May 15 12:04:54 2024
    On Tue, 14 May 2024 15:57:59 GMT
    [email protected] (Scott Lurndal) wrote:


    If you're building an SR-IOV device, you obviously need to build
    it to support the required workloads.


    I am building PCIe device+driver that is unaware of SR-IOV.
    I think that Mitch was operating under the same assumption in the post
    to which you responded.
    When both my device and my driver are aware of presence of additional
    layers of hardware and especially of software between them, then it
    could be made working, but in this case virtualization is no longer transparent, although non-transparency is of different variety from non-transparency of paravirtualization.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to wolfgang kern on Wed May 15 12:07:13 2024
    On Sun, 12 May 2024 15:30:40 +0200
    wolfgang kern <[email protected]> wrote:

    On 12/05/2024 05:44, John Savard wrote:
    I've made another long-overdue change in the Concertina II
    architecture on the page about 17-bit instructions.

    Since I describe the individual instructions there, with their
    opcodes and what they do, I've illustrated the floating-point
    formats of the architecture on that page.

    The good people in charge of the IEEE 754 standard had seen fit to
    define a standard 128-bit floating-point format which included a
    hidden first bit.

    This annoyed me greatly, because I was going to take the 8087's
    temporary real format, and extend the mantissa for my 128-bit
    format.

    I've decided that it's necessary to fully accept the 128-bit
    standard and support it in a consistent manner.

    Therefore, I have taken the following actions:

    I have dropped the option of supporting 80-bit temporary reals
    entirely, as they are now incompatible as an internal format.

    I have instead defined a 256-bit format for floats which does not
    have a hidden first bit, which looks like the old temporary reals,
    except that the exponent field is one bit wider.

    And in addition, just as the IBM 704 used two single-precision
    floats to make a double-precision float, and the IBM System/360
    Model 85 started using two double-precision floats to make an
    extended precision float... I've defined how the 256-bit internal
    format floats can be doubled up to make a 512-bit float.

    I'm not really sure such floating-point precision is useful, but I
    do remember some people telling me that higher float precision is
    indeed something to be desired. Well, the IEEE 754 standard has
    forced my hand.

    YES, I'd use something similar:
    I never cared nor supported any odd 10 byte formats and I give a fart
    to all these weird IEEE standards.


    I suppose, it's mutual.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Wed May 15 13:29:48 2024
    Michael S <[email protected]> writes:
    On Tue, 14 May 2024 15:57:59 GMT
    [email protected] (Scott Lurndal) wrote:


    If you're building an SR-IOV device, you obviously need to build
    it to support the required workloads.


    I am building PCIe device+driver that is unaware of SR-IOV.

    If you expect your device to be used by virtualized (guest)
    operating systems, then I would strongly recommend that you
    support the SR-IOV capability.

    I think that Mitch was operating under the same assumption in the post
    to which you responded.

    I'll allow Mitch to speak for himself.

    Note that pretty much every server-grade network
    controller (NIC) supports SR-IOV.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Wed May 15 20:08:27 2024
    Michael S <[email protected]> schrieb:

    IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix multiplication benchmark running on a single POWER9 core.

    Just reran the tests, it gave me somewhere around 405-410 MFlops
    on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
    This is with the standard gfortran matmul routine.

    I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
    GHz) using my plug-in replacements for gcc __multf3/__addtf3

    Scaled to frequency, the hardware implementation on POWER is then
    better by a factor of around four. Not too bad, actually.

    [..]
    I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
    with one result per cycle, POWER10 has 12 to 13 cycles with two
    results per cycle.

    So, a bottleneck is somewhere else. May be, multiplication?

    I messed up the name of the instruction. What I meant was xsmaddqp
    (just trips off the tounge, doesn't it?), which on POWER9 actually
    has a throughput of 1/13 per cycle, a big, fat instruction,
    obviously. On POWER10, this actually got worse, with performance
    dropping to 1/18 per cycle, with a latency of 25 cycles. Hm,
    apparently somebody didn't think it was all that important,
    apparently :-(

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Thu May 16 00:16:28 2024
    On Wed, 15 May 2024 20:08:27 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    Michael S <[email protected]> schrieb:

    IIRC, you reported something like 200 (or 300?) MFLOPS for your
    matrix multiplication benchmark running on a single POWER9 core.


    Not too bad. Not too good, either.

    Just reran the tests, it gave me somewhere around 405-410 MFlops
    on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
    This is with the standard gfortran matmul routine.


    I don't think that nowadays /proc/cpuinfo has any relationship to
    actual frequency. Most likely with a single core active even the
    cheapest POWER9 SKU runs at 3.8 GHz.
    If there is no ready-made utility, you can measure it by yourself with latency-bound loop. Just don't forget that on POWER9 all simple integer
    opcodes have latency=2.
    If there are any difficulties, I can help.

    I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
    GHz) using my plug-in replacements for gcc __multf3/__addtf3

    Scaled to frequency, the hardware implementation on POWER is then
    better by a factor of around four. Not too bad, actually.


    If my guess about frequency is correct, then more like factor of 2.6.
    Of which, factor of approximately 1.3 has to be attributed to bad
    libgcc ABI.
    [O.T.]
    BTW, on ARM64 libgcc ABI for __multf3/__addtf3 is similarly bad. The
    only decent ABI for __multf3/__addtf3 that I encountered experimenting
    on godbolt was for RV64. But that a little consolation considering huge performance gap between the best RV64 and not even the best, but just a competent iAMD64 or ARM64.
    [/O.T.]

    Anyway, performance per clock is of limited interest. What matters is
    absolute performance (sometimes throughput, sometimes latency) and
    performance per watt.
    I would guess, that using SMT4 POWER9 can get over 80% of theoretical throughput, but getting here would take either multiplying really big
    matrix or lots of medium ones.
    On EPYC3, on the other hand, I don't expect measurable SMT gain. But
    relatively to POWER9 EPYC3 has more cores and much lower power
    consumption per core.

    [..]
    I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
    with one result per cycle, POWER10 has 12 to 13 cycles with two
    results per cycle.

    So, a bottleneck is somewhere else. May be, multiplication?

    I messed up the name of the instruction. What I meant was xsmaddqp
    (just trips off the tounge, doesn't it?), which on POWER9 actually
    has a throughput of 1/13 per cycle, a big, fat instruction,
    obviously. On POWER10, this actually got worse, with performance
    dropping to 1/18 per cycle, with a latency of 25 cycles. Hm,
    apparently somebody didn't think it was all that important,
    apparently :-(

    Sounds like that.
    Hopefully it's compensated by better power efficiency. And
    unfortunately it's aggravated by lower cost-effectiveness. Or, at least
    that what was claimed by poster (luke.l ?) here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Thu May 16 21:22:39 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and or how >>>>> IO is handled in general) since if a page swap needs to happen while >>>>> IO is already in progress (such as a page-miss in the system-call
    process), at present, the OS is dead in the water (one can't access
    the SDcard in the middle of a different access to the SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults
    of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it also
    abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the I/O device hardware directly and initiates DMA transactions to-or-from the
    guest OS directly. With the PCIe PRI (Page Request Interface), the
    guest DMA target pages don't need to be pinned by the hypervisor; the
    I/O MMU will interrupt the hypervisor to make the page present
    and pin it and the hardware will then do the DMA.

    This was something I was not aware of but probably should have anticipated.

    GuestOS initiates an I/O request (command) using a virtual function.
    Rather than going through a bunch of activities to verify the user
    owns the page and it is present, GuestOS just launches request and
    then the I/O device page faults and pins the required page (if it is
    not already so)--much like the page fault volcano when a new process
    begins running:: page faulting in .text, the stack, and data pages
    as they get touched.

    This way, GuestOS simply considers all pages in its "portfolio" to be
    present in memory, and HV does the heavy lifting and page virtualization.

    I guess I should have anticipated this. Sorry !!

    So, having a GuestOS in a position it cannot deal with another page
    fault is no longer a hindrance:: GuestOS does not see that page fault;
    it is just handled and goes away.

    There are two levels of page faults - at the guest level, the
    guest handles everything. When the hypervisors supports
    multplexing multple guests on a core, it will only handle second
    level translation table faults.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Thu May 16 22:07:41 2024
    [email protected] (MitchAlsup1) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and or how >>>>>> IO is handled in general) since if a page swap needs to happen while >>>>>> IO is already in progress (such as a page-miss in the system-call
    process), at present, the OS is dead in the water (one can't access >>>>>> the SDcard in the middle of a different access to the SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults >>>>> of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it also >>>> abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the I/O device
    hardware directly and initiates DMA transactions to-or-from the
    guest OS directly. With the PCIe PRI (Page Request Interface), the
    guest DMA target pages don't need to be pinned by the hypervisor; the
    I/O MMU will interrupt the hypervisor to make the page present
    and pin it and the hardware will then do the DMA.

    This was something I was not aware of but probably should have anticipated.

    GuestOS initiates an I/O request (command) using a virtual function.
    Rather than going through a bunch of activities to verify the user
    owns the page and it is present, GuestOS just launches request and
    then the I/O device page faults and pins the required page (if it is
    not already so)--much like the page fault volcano when a new process
    begins running:: page faulting in .text, the stack, and data pages
    as they get touched.

    This way, GuestOS simply considers all pages in its "portfolio" to be
    present in memory, and HV does the heavy lifting and page virtualization.

    I guess I should have anticipated this. Sorry !!

    Add in MR-IOV and CXL and both I/O and memory can be shared
    cluster-wide.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Thu May 16 23:10:30 2024
    EricP <[email protected]> writes:
    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and or >>>>>>> how IO is handled in general) since if a page swap needs to happen >>>>>>> while IO is already in progress (such as a page-miss in the
    system-call process), at present, the OS is dead in the water (one >>>>>>> can't access the SDcard in the middle of a different access to the >>>>>>> SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>> of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it
    also abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
    device
    hardware directly and initiates DMA transactions to-or-from the
    guest OS directly. With the PCIe PRI (Page Request Interface), the
    guest DMA target pages don't need to be pinned by the hypervisor; the
    I/O MMU will interrupt the hypervisor to make the page present
    and pin it and the hardware will then do the DMA.

    This was something I was not aware of but probably should have anticipated. >>
    GuestOS initiates an I/O request (command) using a virtual function.
    Rather than going through a bunch of activities to verify the user
    owns the page and it is present, GuestOS just launches request and
    then the I/O device page faults and pins the required page (if it is
    not already so)--much like the page fault volcano when a new process
    begins running:: page faulting in .text, the stack, and data pages
    as they get touched.

    This way, GuestOS simply considers all pages in its "portfolio" to be
    present in memory, and HV does the heavy lifting and page virtualization.

    I guess I should have anticipated this. Sorry !!

    The reason OS's pin the pages before the IO starts is so there is no
    latency reading in from a device, which then has to buffer the input.
    An HDD seek avg about 9 ms, add 3 ms for the page fault code.
    A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.

    100Gb/s Ethernet is 10GB/s (20GB/s full duplex). A modern SoC may support multiple 100Gb controllers.

    Granted, for low latency data, the OS and hypervisor will
    cooperate to ensure that the pages are marked present in the IOMMU
    translation mapping tables before the I/O is initiated; PRI is there for
    use cases where the latency to make a page present isn't critical.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Thu May 16 18:30:36 2024
    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and or
    how IO is handled in general) since if a page swap needs to happen >>>>>> while IO is already in progress (such as a page-miss in the
    system-call process), at present, the OS is dead in the water (one >>>>>> can't access the SDcard in the middle of a different access to the >>>>>> SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults >>>>> of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it
    also abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
    device
    hardware directly and initiates DMA transactions to-or-from the
    guest OS directly. With the PCIe PRI (Page Request Interface), the
    guest DMA target pages don't need to be pinned by the hypervisor; the
    I/O MMU will interrupt the hypervisor to make the page present
    and pin it and the hardware will then do the DMA.

    This was something I was not aware of but probably should have anticipated.

    GuestOS initiates an I/O request (command) using a virtual function.
    Rather than going through a bunch of activities to verify the user
    owns the page and it is present, GuestOS just launches request and
    then the I/O device page faults and pins the required page (if it is
    not already so)--much like the page fault volcano when a new process
    begins running:: page faulting in .text, the stack, and data pages
    as they get touched.

    This way, GuestOS simply considers all pages in its "portfolio" to be
    present in memory, and HV does the heavy lifting and page virtualization.

    I guess I should have anticipated this. Sorry !!

    The reason OS's pin the pages before the IO starts is so there is no
    latency reading in from a device, which then has to buffer the input.
    An HDD seek avg about 9 ms, add 3 ms for the page fault code.
    A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.

    What would likely happen is the Ethernet card buffer would fill up
    then it starts tossing packets, while it waits for HV to page fault
    the receive buffer in from its page file. Later when the guest OS
    buffer has faulted in and the card's buffer is emptied, the network
    software will eventually NAK all the tossed packets and they get resent.

    So there is a stutter every time the HV recycles that guest OS memory
    that requires retransmissions to fix. And this is basically using the
    senders memory to buffer the transmission while this HV page faults.

    Note there are devices, like A to D converters which cannot fix the
    tossed data by asking for a retransmission. Or devices like tape drives
    which can rewind and reread but are verrry slow about it.

    I would want an option in this SR-IOV mechanism for the guest app to
    tell the guest OS to tell the HV to pin the buffer before starting IO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Thu May 16 23:10:48 2024
    EricP wrote:

    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and or >>>>>>> how IO is handled in general) since if a page swap needs to happen >>>>>>> while IO is already in progress (such as a page-miss in the
    system-call process), at present, the OS is dead in the water (one >>>>>>> can't access the SDcard in the middle of a different access to the >>>>>>> SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>> of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it
    also abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
    device
    hardware directly and initiates DMA transactions to-or-from the
    guest OS directly. With the PCIe PRI (Page Request Interface), the
    guest DMA target pages don't need to be pinned by the hypervisor; the
    I/O MMU will interrupt the hypervisor to make the page present
    and pin it and the hardware will then do the DMA.

    This was something I was not aware of but probably should have anticipated. >>
    GuestOS initiates an I/O request (command) using a virtual function.
    Rather than going through a bunch of activities to verify the user
    owns the page and it is present, GuestOS just launches request and
    then the I/O device page faults and pins the required page (if it is
    not already so)--much like the page fault volcano when a new process
    begins running:: page faulting in .text, the stack, and data pages
    as they get touched.

    This way, GuestOS simply considers all pages in its "portfolio" to be
    present in memory, and HV does the heavy lifting and page virtualization.

    I guess I should have anticipated this. Sorry !!

    The reason OS's pin the pages before the IO starts is so there is no
    latency reading in from a device, which then has to buffer the input.
    An HDD seek avg about 9 ms, add 3 ms for the page fault code.
    A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.

    What would likely happen is the Ethernet card buffer would fill up
    then it starts tossing packets, while it waits for HV to page fault
    the receive buffer in from its page file. Later when the guest OS
    buffer has faulted in and the card's buffer is emptied, the network
    software will eventually NAK all the tossed packets and they get resent.

    So there is a stutter every time the HV recycles that guest OS memory
    that requires retransmissions to fix. And this is basically using the
    senders memory to buffer the transmission while this HV page faults.

    Note there are devices, like A to D converters which cannot fix the
    tossed data by asking for a retransmission. Or devices like tape drives
    which can rewind and reread but are verrry slow about it.

    I would want an option in this SR-IOV mechanism for the guest app to
    tell the guest OS to tell the HV to pin the buffer before starting IO.



    So, what happens if GuestOS thinks the user file is located on a local
    SATA drive, but it is really across some network ?? This works when
    devices are not virtualized since the request is routed to a different
    system where the file is local, accessed and data returned over the
    network.

    Does this mean the application has lost a level of indirection in order
    to have become virtualized ?????

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Fri May 17 15:26:20 2024
    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup1) writes:
    BGB wrote:

    Also I may need to rework how page-in/page-out is handled (and >>>>>>>> or how IO is handled in general) since if a page swap needs to >>>>>>>> happen while IO is already in progress (such as a page-miss in >>>>>>>> the system-call process), at present, the OS is dead in the
    water (one can't access the SDcard in the middle of a different >>>>>>>> access to the SDcard).

    Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>>> of the OS page fault handler.

    Seems like adding another layer couldn't help with this, unless it >>>>>> also abstracts away the SDcard interface.

    With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

    Actually, that's not completely accurate. With PCI Express SR-IOV,
    an I/O MMU and hardware I/O virtualization, the guest accesses the
    I/O device
    hardware directly and initiates DMA transactions to-or-from the
    guest OS directly. With the PCIe PRI (Page Request Interface), the
    guest DMA target pages don't need to be pinned by the hypervisor; the
    I/O MMU will interrupt the hypervisor to make the page present
    and pin it and the hardware will then do the DMA.

    This was something I was not aware of but probably should have
    anticipated.

    GuestOS initiates an I/O request (command) using a virtual function.
    Rather than going through a bunch of activities to verify the user
    owns the page and it is present, GuestOS just launches request and
    then the I/O device page faults and pins the required page (if it is
    not already so)--much like the page fault volcano when a new process
    begins running:: page faulting in .text, the stack, and data pages
    as they get touched.

    This way, GuestOS simply considers all pages in its "portfolio" to be
    present in memory, and HV does the heavy lifting and page
    virtualization.

    I guess I should have anticipated this. Sorry !!

    The reason OS's pin the pages before the IO starts is so there is no
    latency reading in from a device, which then has to buffer the input.
    An HDD seek avg about 9 ms, add 3 ms for the page fault code.
    A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.

    What would likely happen is the Ethernet card buffer would fill up
    then it starts tossing packets, while it waits for HV to page fault
    the receive buffer in from its page file. Later when the guest OS
    buffer has faulted in and the card's buffer is emptied, the network
    software will eventually NAK all the tossed packets and they get resent.

    So there is a stutter every time the HV recycles that guest OS memory
    that requires retransmissions to fix. And this is basically using the
    senders memory to buffer the transmission while this HV page faults.

    Note there are devices, like A to D converters which cannot fix the
    tossed data by asking for a retransmission. Or devices like tape drives
    which can rewind and reread but are verrry slow about it.

    I would want an option in this SR-IOV mechanism for the guest app to
    tell the guest OS to tell the HV to pin the buffer before starting IO.



    So, what happens if GuestOS thinks the user file is located on a local
    SATA drive, but it is really across some network ?? This works when
    devices are not virtualized since the request is routed to a different
    system where the file is local, accessed and data returned over the
    network.

    Does this mean the application has lost a level of indirection in order
    to have become virtualized ?????

    I don't understand your question.
    My comment was about the consequences of not pinning buffer pages
    before starting an I/O. If those pages were for a mapped file stored
    on a network device it won't be different.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Fri May 17 21:16:33 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    I would want an option in this SR-IOV mechanism for the guest app to
    tell the guest OS to tell the HV to pin the buffer before starting IO.



    So, what happens if GuestOS thinks the user file is located on a local
    SATA drive, but it is really across some network ?? This works when
    devices are not virtualized since the request is routed to a different
    system where the file is local, accessed and data returned over the
    network.

    Does this mean the application has lost a level of indirection in order
    to have become virtualized ?????

    I don't understand your question.
    My comment was about the consequences of not pinning buffer pages
    before starting an I/O. If those pages were for a mapped file stored
    on a network device it won't be different.



    I don't understand your question.

    Most of users files are on the local system and SR-IOV works fine, but one
    or more of his files exist on a remote machine accessed over the internet;
    and user still uses SR-IOV interface to access those files.

    How does the system provide the "file is local" illusion to a user having SR-IOV access to a non-local file.

    For example, user opens a file which is an ln-s (a block containing a URL
    to where the file is remotely stored) but user thinks file is local.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Sat May 18 09:40:07 2024
    Chris M. Thomasson wrote:
    On 5/17/2024 12:26 PM, EricP wrote:
    MitchAlsup1 wrote:

    So, what happens if GuestOS thinks the user file is located on a local
    SATA drive, but it is really across some network ?? This works when
    devices are not virtualized since the request is routed to a different
    system where the file is local, accessed and data returned over the
    network.

    Does this mean the application has lost a level of indirection in order
    to have become virtualized ?????

    I don't understand your question.
    My comment was about the consequences of not pinning buffer pages
    before starting an I/O. If those pages were for a mapped file stored
    on a network device it won't be different.

    For some reason this made me think about getting a blue screen of death
    due to too much non-paged memory being used by too many concurrent
    overlapped IO's on Windows.

    That shouldn't happen as Windows tracks each process's non-paged pool allocations and quotas and it should return an error when exceeded,
    though I've never stress tested it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Sat May 18 14:25:16 2024
    [email protected] (MitchAlsup1) writes:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    I would want an option in this SR-IOV mechanism for the guest app to
    tell the guest OS to tell the HV to pin the buffer before starting IO.



    So, what happens if GuestOS thinks the user file is located on a local
    SATA drive, but it is really across some network ?? This works when
    devices are not virtualized since the request is routed to a different
    system where the file is local, accessed and data returned over the
    network.

    Does this mean the application has lost a level of indirection in order
    to have become virtualized ?????

    I don't understand your question.
    My comment was about the consequences of not pinning buffer pages
    before starting an I/O. If those pages were for a mapped file stored
    on a network device it won't be different.



    I don't understand your question.

    Most of users files are on the local system and SR-IOV works fine, but one
    or more of his files exist on a remote machine accessed over the internet; >and user still uses SR-IOV interface to access those files.

    SR-IOV is a feature of PCI Express devices.


    How does the system provide the "file is local" illusion to a user having >SR-IOV access to a non-local file.

    So which PCI Express device is being used to access the device(s) that
    contain the file system that manages the file, which contains
    the data blocks?

    If it is a NVMe device, then the virtual functions are providing
    access to portions of the NVram on the device (or behind the
    device) as if it were a physical device. If it is a SATA
    device, then the virtual function may be providing access to
    a complete unit, or a partition on a shared unit.


    For example, user opens a file which is an ln-s (a block containing a URL
    to where the file is remotely stored) but user thinks file is local.

    That's all handled by the file system code in the operating system. It's
    when a particular data block is required that an SR-IOV virtual function
    may be used (and the file system could easily be combining multiple
    underlying devices (VFs) into a single filesystem (e.g. using volume
    management facililites of the operating system) such that some blocks
    in the file system are on a SAN (fibrechanel), some are on a LAN
    (CIFS/NFS) and some are on a locally hosted SATA or NVMe device).

    For the SAN case, the fibrechannel adapter will provide VFs that
    can be used by the guest OS. Likewise for NVMe. I don't believe
    that the SATA (which is rather obsolete now) AHCI standard supports
    SR-IOV, but it would be pretty straightforward to add the SR-IOV
    capability to a SATA controller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sat May 18 11:56:19 2024
    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    I would want an option in this SR-IOV mechanism for the guest app to
    tell the guest OS to tell the HV to pin the buffer before starting IO.



    So, what happens if GuestOS thinks the user file is located on a local
    SATA drive, but it is really across some network ?? This works when
    devices are not virtualized since the request is routed to a different
    system where the file is local, accessed and data returned over the
    network.

    Does this mean the application has lost a level of indirection in order
    to have become virtualized ?????

    I don't understand your question.
    My comment was about the consequences of not pinning buffer pages
    before starting an I/O. If those pages were for a mapped file stored
    on a network device it won't be different.

    Most of users files are on the local system and SR-IOV works fine, but one
    or more of his files exist on a remote machine accessed over the internet; and user still uses SR-IOV interface to access those files.

    If by "works fine" you mean is slower and has more overhead than
    just pinning the pages first as DMA I/O does now.

    (Its more work to initiate the I/O, fail when it attempts to DMA,
    interrupt cpu, run ISR which queues a DPC which queues an APC back to
    the thread, which pins the pages, then restarts the I/O,
    than it is to just pin the pages and start and I/O.)

    How does the system provide the "file is local" illusion to a user having SR-IOV access to a non-local file.

    For example, user opens a file which is an ln-s (a block containing a
    URL to where the file is remotely stored) but user thinks file is local.

    I think I see what you are getting at - how does this mechanism
    transparently redirect the SR-IOV device request into a network request?

    That link is traditionally established at file open inside the kernel file system by cross-linking between a File Control Block (or whatever its called) and a Network Control Block representing that file on the network.
    Each file syscall request is sent to the FCB then forwarded to the NCB
    and out over a network link.

    As I understand it, SR-IOV is a pseudo hardware device control *register*, whereas a disk file is a fictional logical device created by the file
    system driver. I don't think one could use SR-IOV to send commands to
    local file system software (maybe it could trap into the OS).

    One could have *disk controller registers* attached by SR-IOV,
    but a disk controller is not a file system.

    So as I understand the SR-IOV mechanism, one would not be reading
    local or remote files over it under any circumstance.
    But my understanding is limited to what the Microsoft driver
    documentation says about it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat May 18 17:33:19 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    I would want an option in this SR-IOV mechanism for the guest app to >>>>> tell the guest OS to tell the HV to pin the buffer before starting IO. >>>>


    So, what happens if GuestOS thinks the user file is located on a local >>>> SATA drive, but it is really across some network ?? This works when
    devices are not virtualized since the request is routed to a different >>>> system where the file is local, accessed and data returned over the
    network.

    Does this mean the application has lost a level of indirection in order >>>> to have become virtualized ?????

    I don't understand your question.
    My comment was about the consequences of not pinning buffer pages
    before starting an I/O. If those pages were for a mapped file stored
    on a network device it won't be different.

    Most of users files are on the local system and SR-IOV works fine, but one >> or more of his files exist on a remote machine accessed over the internet; >> and user still uses SR-IOV interface to access those files.

    If by "works fine" you mean is slower and has more overhead than
    just pinning the pages first as DMA I/O does now.

    Yes, copy on write has this problem too when most of the address space
    gets copied.

    (Its more work to initiate the I/O, fail when it attempts to DMA,
    interrupt cpu, run ISR which queues a DPC which queues an APC back to
    the thread, which pins the pages, then restarts the I/O,
    than it is to just pin the pages and start and I/O.)

    How does the system provide the "file is local" illusion to a user having
    SR-IOV access to a non-local file.

    For example, user opens a file which is an ln-s (a block containing a
    URL to where the file is remotely stored) but user thinks file is local.

    I think I see what you are getting at - how does this mechanism
    transparently redirect the SR-IOV device request into a network request?

    The other interpretation is that the unprivileged uses is never allowed
    direct access to an SR-IOV device--those are restricted to GuestOS (or
    more privileged hypervisor threads).

    That link is traditionally established at file open inside the kernel file system by cross-linking between a File Control Block (or whatever its called) and a Network Control Block representing that file on the network.
    Each file syscall request is sent to the FCB then forwarded to the NCB
    and out over a network link.

    As I understand it, SR-IOV is a pseudo hardware device control *register*, whereas a disk file is a fictional logical device created by the file
    system driver. I don't think one could use SR-IOV to send commands to
    local file system software (maybe it could trap into the OS).

    One could have *disk controller registers* attached by SR-IOV,
    but a disk controller is not a file system.

    So as I understand the SR-IOV mechanism, one would not be reading
    local or remote files over it under any circumstance.
    But my understanding is limited to what the Microsoft driver
    documentation says about it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Sat May 18 21:48:14 2024
    Chris M. Thomasson wrote:
    On 5/18/2024 6:40 AM, EricP wrote:
    Chris M. Thomasson wrote:
    On 5/17/2024 12:26 PM, EricP wrote:

    I don't understand your question.
    My comment was about the consequences of not pinning buffer pages
    before starting an I/O. If those pages were for a mapped file stored
    on a network device it won't be different.

    For some reason this made me think about getting a blue screen of
    death due to too much non-paged memory being used by too many
    concurrent overlapped IO's on Windows.

    That shouldn't happen as Windows tracks each process's non-paged pool
    allocations and quotas and it should return an error when exceeded,
    though I've never stress tested it.


    I have, wrt NT 4.0 back in the day. It can get to a point where the
    system is totally unresponsive. Then, sometimes, dies. A shit load of concurrent overlapped io ops, malloc tends to return NULL, then the
    non-paged memory gets really bad...

    What I have seen is due to what turned out to be a bug in Windows Defender
    is that in monitoring my internet packets it would leak non-paged pool,
    which then grows consuming more and more free pages all while the system
    gets slower and slower, until finally it hangs and has to be rebooted.
    The solution was to disable Windows Defender, but I only discovered that
    by chance (random thrashing about).

    I've only seen one blue screen in 30 years of using WinNT.
    That was a "Page fault at raised IRQL" in the Microsoft TCP driver
    (basically, a page fault occurred inside a driver, a big no-no)
    back in the 1990's and the replacement driver was already available
    on the Microsoft website.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to All on Sun May 19 11:17:41 2024
    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Sun May 19 16:23:33 2024
    On Sun, 19 May 2024 11:17:41 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.


    I.e. your actual running frequency was 3700 MHz?

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.


    There exists a middle ground between none-pipelined and fully pipelined multiplier/FMA units. In fact, more than one middle ground.
    Here the mid-middle ground that can imagine not being a real hardware
    guy:
    1 - take a pair of exiting VSU multipliers. By now they can do
    53x53=>125bit unsigned multiplication. Enhance them to 57x57=>113bit
    2 - during quad-precision FMA split 113x113 multiplication into 4
    pieces and run them through pair of multiplies each two at once.
    That would produce all parts of 225-bit product at rate of 1 product
    per 2 clocks
    3 - build adders just sufficient for the same throughput of 1 result
    per 2 clocks.
    Such combined multiplier will have 2 clocks higher latency than DP
    multiplier.
    After that we'll need matching alignment and addition/subtraction
    blocks, but by doing them half-pipelined we can utilize majority of
    existing dual-DP hardware and would need very little else, except of
    control signals and probably of new feedback data path on the upper
    side of the adder. All that could cost us another clock of latency over
    DP FMA, but not necessarily so.
    Bottom line: QP FMA with throughput of 1 result per 2 clocks and
    latency of 8 or 9 clocks.
    For POWER8, that has less distributed VSU, such modification would be
    somewhat easier than for POWER9.


    That's what I call a mid-middle ground.
    Low-middle ground would be leaving 53x53=>125bit multipliers
    unmodified. 113x113 multiplication is split into 9 pieces and
    product is delivered every 5 clocks.

    High-middle ground is enhancing both VSU pipes and using them to
    process two QP FMAs simultaneously for combined throughput equivalent
    to fully pipelined.

    Another possible high-middle ground is, again, enhancing both VSU pipes
    and using them together on a single QP FMA. That would be potentially
    best for latency, but does not fit well into philosophy of POWER9
    design that tries to minimize high-speed interaction between various
    pipes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Sun May 19 16:02:01 2024
    Michael S wrote:

    On Sun, 19 May 2024 11:17:41 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.


    I.e. your actual running frequency was 3700 MHz?

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.


    There exists a middle ground between none-pipelined and fully pipelined multiplier/FMA units. In fact, more than one middle ground.
    Here the mid-middle ground that can imagine not being a real hardware
    guy:
    1 - take a pair of exiting VSU multipliers. By now they can do
    53x53=>125bit unsigned multiplication. Enhance them to 57x57=>113bit
    2 - during quad-precision FMA split 113x113 multiplication into 4
    pieces and run them through pair of multiplies each two at once.
    That would produce all parts of 225-bit product at rate of 1 product
    per 2 clocks
    3 - build adders just sufficient for the same throughput of 1 result
    per 2 clocks.
    Such combined multiplier will have 2 clocks higher latency than DP multiplier.

    That is the slow middle ground using the multiplier at ½ rate. AND is
    in fact the design point for my low end machine (the div 2 part, not
    the quad precision part).

    Instead, one can use the multiplier at full speed. If as you state
    below,
    that FMAC is 9 cycles, DP FMAC here would be 7 cycles and 10 cycles for
    QP FMAC.

    On the other hand, I worry about throughput after I saw a string of
    42 instructions in a row all using FMAC function unit in one particular benchmark.

    After that we'll need matching alignment and addition/subtraction
    blocks, but by doing them half-pipelined we can utilize majority of
    existing dual-DP hardware and would need very little else, except of
    control signals and probably of new feedback data path on the upper
    side of the adder. All that could cost us another clock of latency over
    DP FMA, but not necessarily so.
    Bottom line: QP FMA with throughput of 1 result per 2 clocks and
    latency of 8 or 9 clocks.
    For POWER8, that has less distributed VSU, such modification would be somewhat easier than for POWER9.


    That's what I call a mid-middle ground.
    Low-middle ground would be leaving 53x53=>125bit multipliers
    unmodified. 113x113 multiplication is split into 9 pieces and
    product is delivered every 5 clocks.

    High-middle ground is enhancing both VSU pipes and using them to
    process two QP FMAs simultaneously for combined throughput equivalent
    to fully pipelined.

    Another possible high-middle ground is, again, enhancing both VSU pipes
    and using them together on a single QP FMA. That would be potentially
    best for latency, but does not fit well into philosophy of POWER9
    design that tries to minimize high-speed interaction between various
    pipes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Sun May 19 18:37:51 2024
    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm guessing
    that this could at least be close to needing an extra cycle on its own
    and/or heroic hardware?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Sun May 19 16:43:40 2024
    [email protected] (MitchAlsup1) writes:
    EricP wrote:


    I think I see what you are getting at - how does this mechanism
    transparently redirect the SR-IOV device request into a network request?

    The other interpretation is that the unprivileged uses is never allowed >direct access to an SR-IOV device--those are restricted to GuestOS (or
    more privileged hypervisor threads).


    Take a look at DPDK or ODP. Both support accessing PCI functions
    directly from unprivileged processes on Intel and ARM64 systems;
    doesn't matter if they're virtual functions via SR-IOV or physical
    functions on a non-SRIOV device.

    The key insight here is that SR-IOV is pretty much invisible to
    the operating system - a virtual function looks just like a
    physical function, and they're located the same way by scanning
    the PCI configuration space via the ECAM. The host physical
    function driver knows how to configure the SR-IOV capability
    to expose the virtual functions, and the guest physical function
    drivers accesses one of those VFs thinking it is a standard PF.

    A filesystem driver in an operating system just passes block
    requests to a device driver (SATA/NVMe/FC/NIC), and for this
    purpose SR-IOV is completely invisible to the filesystem
    driver and the device drivers themselves.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Sun May 19 16:38:11 2024
    EricP <[email protected]> writes:
    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    MitchAlsup1 wrote:
    Scott Lurndal wrote:

    I would want an option in this SR-IOV mechanism for the guest app to >>>>> tell the guest OS to tell the HV to pin the buffer before starting IO. >>>>


    So, what happens if GuestOS thinks the user file is located on a local >>>> SATA drive, but it is really across some network ?? This works when
    devices are not virtualized since the request is routed to a different >>>> system where the file is local, accessed and data returned over the
    network.

    Does this mean the application has lost a level of indirection in order >>>> to have become virtualized ?????

    I don't understand your question.
    My comment was about the consequences of not pinning buffer pages
    before starting an I/O. If those pages were for a mapped file stored
    on a network device it won't be different.

    Most of users files are on the local system and SR-IOV works fine, but one >> or more of his files exist on a remote machine accessed over the internet; >> and user still uses SR-IOV interface to access those files.

    If by "works fine" you mean is slower and has more overhead than
    just pinning the pages first as DMA I/O does now.

    (Its more work to initiate the I/O, fail when it attempts to DMA,
    interrupt cpu, run ISR which queues a DPC which queues an APC back to
    the thread, which pins the pages, then restarts the I/O,
    than it is to just pin the pages and start and I/O.)

    How does the system provide the "file is local" illusion to a user having
    SR-IOV access to a non-local file.

    For example, user opens a file which is an ln-s (a block containing a
    URL to where the file is remotely stored) but user thinks file is local.

    I think I see what you are getting at - how does this mechanism
    transparently redirect the SR-IOV device request into a network request?

    That link is traditionally established at file open inside the kernel file >system by cross-linking between a File Control Block (or whatever its called) >and a Network Control Block representing that file on the network.
    Each file syscall request is sent to the FCB then forwarded to the NCB
    and out over a network link.

    As I understand it, SR-IOV is a pseudo hardware device control *register*,

    That's not quite correct. Consider PCI (or PCI Express) without
    SR-IOV. The PCI designation for an assignable device entity is
    called a 'function' aka 'physical function'. A PCI express device
    can have up to 8 functions[*] - each of which is a full independent
    controller instance (e.g. a SATA controller, or Network Interface
    Controller).

    There are three distinct address spaces for each function; a
    configuration address space, a memory address space and an
    I/O address space.

    The configuration address space consists of 4096 bytes,
    where the first 32 bytes describe a set of PCI configuration
    registers first defined in the original PCI specification.

    The remaining configuration space consists of two linked
    lists of optional "capabilities", each of which has a specific
    standard set of defined control and status registers.
    There are dozens of optional capabilities, one of which is
    the SR-IOV capability.

    [*] There is a PCI Express extended capability called alternate
    routing interpretation (ARI) which allows a PCI Express device
    to support up to 256 functions, which consumes an entire
    PCI bus.

    When a device advertises the SR-IOV capability in its configuration
    address space when the operating system scans the ECAM associated
    with the root complex to which the device is attached, the operating
    system will access the first function (typically zero) configuration
    space to read the first four bytes (vendor ID and device ID). Those
    are used to index into a driver table and then load the necessary
    driver based on the device id. The driver will read the SR-IOV
    capability status registers and update them to indicate how may
    virtual functions should be advertised and what the associated
    BAR registers should advertise for the VF BARs.

    A physical function with SR-IOV can support up to 65535
    virtual functions (consuming 256 256-function PCI buses)


    whereas a disk file is a fictional logical device created by the file
    system driver. I don't think one could use SR-IOV to send commands to
    local file system software (maybe it could trap into the OS).

    All of this is not relevent - filesystem as a concept is independent
    of the underlying block storage mechanisms, and the beauty of
    SR-IOV is that the VFs' look to the guest like PFs, so the guest
    just uses the standard non-SRIOV driver that matches the
    vendor/device id in the configuration space for the VF.

    So, when the filesystem needs a block of storage, the filesystem
    will simply initiate a request to the VF as if it were a PF; whether
    it is a NVMe adapter, fiberchannel adapter, or NIC with SCSI-over-IP.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Sun May 19 20:34:03 2024
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:

    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra cycle
    on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before
    multiplication, the product of multiplication is 226 bits with two MS
    bits != '00'. I don't see how we would ever need more than 229 bits fed
    into accumulation phase and into following normalizer. Of course, all
    bits that are lower that LS bit have to be collapsed (by OR) into LS
    bit. May be, even less than 229 bits will do, by now I am not sure.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Sun May 19 20:52:03 2024
    Terje Mathisen wrote:

    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm guessing

    that this could at least be close to needing an extra cycle on its own
    and/or heroic hardware?

    If you organize the multiplications and accumulations from most
    significance
    towards least significance, this wide effect is pipelined away, because
    you initialize the accumulation with the augend and check for zero as multiplies fall out of the tree.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Sun May 19 21:07:49 2024
    Michael S wrote:

    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:

    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra cycle
    on its own and/or heroic hardware?

    Terje


    Why so wide?

    Consider a 128-bit FP container.
    1-bit sign
    15-bit exponent
    113-bit fraction

    The augend can be larger than what comes out of the multiplier, the
    worst
    case is 113-bits bigger (any bigger and the result of the tree becomes irrelevant.)
    The augend can be smaller than what comes out of the multiplier, the
    worst case lines up below the lowest bit that comes out of the tree
    (otherwise it would not participate in rounding).

    Thus we have:
    113-bit register below the multiplier,
    225-bit product
    113-bit incremented above the multiplier.
    -------
    450-bits. //this might be 2-bits wider than necessary.

    BTW its 207-bits for DP.

    Assuming that subnormal multiplier inputs are normalized before multiplication,

    Bad assumption for HW, maybe acceptable for SW.

    the product of multiplication is 226 bits with two MS
    bits != '00'. I don't see how we would ever need more than 229 bits fed
    into accumulation phase and into following normalizer.

    Augend can be positioned 113-bits above the tree or right below the
    tree
    thus the above arithmetic. Until you know the augend position, there
    must
    be circuitry to determine where the HoB is, deNormalized numbers only
    perturb this by small amounts of logic.

    OH and BTW, one can build a Find-First circuit that is ¼ the size of
    the leading zero predictor and no slower when selecting the HoB to
    normalize. {Academic papers notwithstanding.}

    Of course, all
    bits that are lower that LS bit have to be collapsed (by OR) into LS
    bit. May be, even less than 229 bits will do, by now I am not sure.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Sun May 19 21:16:31 2024
    BGB wrote:

    On 5/19/2024 11:37 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA.  Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition.  This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm guessing

    that this could at least be close to needing an extra cycle on its own
    and/or heroic hardware?


    This sort of thing is part of what makes proper FMA hopelessly
    expensive.

    Getting the LoB correctly rounded showed up the generation prior to
    FMAC showing up.

    Granted, full FMA also allows faking higher precision using

    SIMD vector operations, with math that does not work with double-rounded

    FMA instructions.

    It also enabled error free floating point calculations, but no existing
    FP implementation allows exact FP calculations that do not ALSO SET the
    inexact flag !?!? {Whereas My 66000 gets this right}

    Well, and also an issue if one can "just barely" afford to have a single

    double-precision unit.

    This is NOT an architectural issue, but an implementation choice issue.

    Though, the trick of possibly having four 27-bit multiplies which
    combine into a virtual 54 bit multiplier seems like an interesting possibility, though not great as DSP's don't natively handle this size
    (and would be too expensive to stretch it out with LUTs). Likely, one
    would need to build it from 34*34->68 bit multipliers (each costing 4
    DSPs).

    This is your implementation choice coloring what you take as
    architectural
    decisions.

    In terms of DSP cost, it would be higher than the current solution:
    16 vs 6+4 (10).
    But, possibly lower LUT cost (in both the Binary32 and Binary64
    multipliers, the shortfall is made up using smaller LUT-based
    multipliers).

    We can now fit (5nm) hundreds of GBOoO cores on a single die. The
    difference
    between a 53×53 tree and a 64×64 tree (makes all problems vanish) is
    not
    visible at this level (100+ cores on a die).

    This is your implementation choice coloring you thoughts.

    Though, with the combiner option, one could make a case for, say, a:
    S.E15.F66.Z46 format (Z=zeroed/ignored).

    Well, and/or accept the wonk of a Binary128 which produces 112 bits of mantissa, but only uses the high 66 bits or so, but generally this was
    worse for some things in some tests than one which simply zeroes the low-order bits.

    But it allows for exact FP arithmetic, and for FMAC, ..... and lots of
    other good properties.

    What kind of car do you drive ??

    But, OTOH, 66*66->112 would allow for possible trickery to fake a full Binary128 FMUL in software as a multi-part process (when combined with a

    Binary128 FADD).

    A 1-bit wide machine can perform 128 × 128 + 128 FMACs -- it just takes
    more
    time.

    ....

    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon May 20 00:10:48 2024
    BGB wrote:

    On 5/19/2024 4:16 PM, MitchAlsup1 wrote:
    BGB wrote:

    On 5/19/2024 11:37 AM, Terje Mathisen wrote:
    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA.  Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition.  This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing

    that this could at least be close to needing an extra cycle on its
    own and/or heroic hardware?


    This sort of thing is part of what makes proper FMA hopelessly
    expensive.

    Getting the LoB correctly rounded showed up the generation prior to
    FMAC showing up.


    Well, in this case, I have neither in a proper sense.

    FMAC operators were sorta faked, but mostly exist because they were
    needed for RV64G, but double-rounded (and not able to expose anything
    that exists below the ULP, unlike proper FMA).

    But FMAC can expose the bits below LoB.


                Granted, full FMA also allows faking higher precision using

    SIMD vector operations, with math that does not work with
    double-rounded

    FMA instructions.

    It also enabled error free floating point calculations, but no existing
    FP implementation allows exact FP calculations that do not ALSO SET the
    inexact flag !?!? {Whereas My 66000 gets this right}


    Dunno.

    It seems like the existence of anything below the ULP justifies setting

    the inexact flag...

    You misunderstand !!

    When one computes 2 Operands that are single wide, and can deliver a
    single result twice as wide or a pair of results each single wide,
    you are delivering all the bits, so there is no inexact. However, if
    you use more than 1 instruction to perform the calculation, then, you
    HAVE to set an inexact bit even though the delivery of the second
    result makes the first setting of the inexact bit in error !!

    My ISA is expressive enough to do this, just like IEEE 754-2019
    requires on augmented addition and augmented subtraction.

    Well, and also an issue if one can "just barely" afford to have a
    single

    double-precision unit.

    This is NOT an architectural issue, but an implementation choice issue.


    Absent things like microcode or traps, architectural and implementation

    choices are closely tied together. Can't have instructions for things
    which one can't afford the hardware cost to implement.

    I understand your limitations--the problem I have is that you express
    your limitations AS-IF others should make the same choices you had to
    make. And that is patently FALSE !!

    Defending an indefensible position under the illusion that "That's all I

    got to work with" is an insufficient defense against someone who has
    more.

    Well, and the usefulness of an FPU is dependent on performance.
    Inaccurate FPU can still be useful, but slow FPU is not.

    Kahan has several lectures about this....

    Though, the trick of possibly having four 27-bit multiplies which
    combine into a virtual 54 bit multiplier seems like an interesting
    possibility, though not great as DSP's don't natively handle this size
    (and would be too expensive to stretch it out with LUTs). Likely, one
    would need to build it from 34*34->68 bit multipliers (each costing 4
    DSPs).

    This is your implementation choice coloring what you take as
    architectural
    decisions.

    In terms of DSP cost, it would be higher than the current solution:
       16 vs 6+4 (10).
    But, possibly lower LUT cost (in both the Binary32 and Binary64
    multipliers, the shortfall is made up using smaller LUT-based
    multipliers).

    We can now fit (5nm) hundreds of GBOoO cores on a single die. The
    difference between a 53×53 tree and a 64×64 tree (makes all problems
    vanish) is
    not
    visible at this level (100+ cores on a die).

    This is your implementation choice coloring you thoughts.


    I can afford FPGAs...
    I can't afford to get an ASIC made.

    I am not asking you to spend big money--I am merely asking you to quit defending "doing the wrong thing" when others have to follow standards.
    {{If you properly caveated all your defense statements--I would not
    complain.}}

    So, implementation choices here are:
    FPGA;
    Nothing.

    I have been wondering for a while--are the DSP things you build your
    multiplier out of synthesized by Verliog compilation, or hard coded
    into the gates themselves ?? Because if they are synthesized, you could
    create Verilog that builds the multiplier tree of whatever size you
    need
    without all the DSP overhead.


    What kind of car do you drive ??


    I don't drive a car...
    I tend to fairly rapidly get tired out if trying to drive.

    I was going to ask if your car had hand rolled windows, a manual
    transmission, ... in the early 1980s all of us were similarly
    constrained, computer architecture grew out of the fast-and-dirty
    modus operandi and into the follow-standards Operandi.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Mon May 20 09:24:16 2024
    Michael S wrote:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:

    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra cycle
    on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before

    They are not, this is part of what you do to make subnormal numbers
    exactly the same speed as normal inputs.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Mon May 20 11:30:45 2024
    On Mon, 20 May 2024 09:24:16 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:

    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it
    came to around 18 cycles per FMA. Compared to the 13 cycles for
    the FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so
    it needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra
    cycle on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before

    They are not, this is part of what you do to make subnormal numbers
    exactly the same speed as normal inputs.

    Terje


    1. I am not sure that "the same speed" is a worthy goal even for
    binary64 (for binary32 it is).
    2. It's certainly does not sound like a worthy goal for binary128,
    where probability of encountering sub-normal inputs in real user code,
    rather than in test vector, is lower than DP by another order of
    magnitude,
    3. Even if, for reason unclear to me, it is considered the goal, it can
    be achieved by introduction of one more pipeline stage everywhere.
    Since we are discussing high-latency design akin to POWER9, the
    relative cost of another stage would be lower. BTW, according to POWER9
    manual, even for SP/DP FMA the latency is not constant. It varies from
    5 to 7.

    So, IMHO, what you do to handle sub-normal inputs should depend on what
    ends up smaller or faster, not on some abstract principles. For less
    important unit, like binary128, 'smaller' would likely take
    relative precedence over 'faster'. It's possible that you'll end up
    with not doing pre-normalization, but the reason for it would be
    different from 'same speed'.

    Besides, pre-normalization vs wider post-normalization are not the only available choices. When multiplier is naturally segmented into 57-bit
    section, there exists, for example, an option of pre-normalization by
    full section. It looks very simple on the front and saves quite a lot
    of shifter's width on the back.

    But the best option is probably described in above post by Mitch. If I understood his post correctly, he suggests to have two alignment stages:
    one after multiplication and another one after add/sub. The shift count
    for a first stage is calculated from inputs in parallel with
    multiplication. The first alignment stage does not try to achieve a
    perfect normalizations, but it does enough for cutting the width of
    following adder from 3N to 2N+eps.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Mon May 20 10:12:45 2024
    Michael S <[email protected]> schrieb:
    On Sun, 19 May 2024 11:17:41 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    So, I did some more measurements on the POWER9 machine, and it came
    to around 18 cycles per FMA. Compared to the 13 cycles for the
    FMA instruction, this actually sounds reasonable.


    I.e. your actual running frequency was 3700 MHz?

    Approximately, yes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Mon May 20 10:56:48 2024
    Michael S <[email protected]> writes:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:
    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra cycle
    on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before >multiplication, the product of multiplication is 226 bits

    The product of the mantissa multiplication is at most 226 bits even if
    you don't normalize subnormal numbers. For cancellation to play a
    role the addend has to be close in absolute value and have the
    opposite sign as the product, so at most one additional bit comes into
    play for that case (for something like the product being
    0111111... and the addend being -10000000...).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Anton Ertl on Mon May 20 14:28:50 2024
    Anton Ertl wrote:
    Michael S <[email protected]> writes:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:
    The FMA normalizer has to handle a maximally bad cancellation, so it
    needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra cycle
    on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before
    multiplication, the product of multiplication is 226 bits

    The product of the mantissa multiplication is at most 226 bits even if
    you don't normalize subnormal numbers. For cancellation to play a
    role the addend has to be close in absolute value and have the
    opposite sign as the product, so at most one additional bit comes into
    play for that case (for something like the product being
    0111111... and the addend being -10000000...).

    This is the part of Mitch's explanation that I have never been able to
    totally grok, I do think you could get away with less bits, but only if
    you can collapse the extra mantissa bits into sticky while aligning the
    product with the addend. If that takes too long or it turns out to be easier/faster in hardware to simply work with a much wider mantissa,
    then I'll accept that.

    I don't think I've ever seen Mitch make a mistake on anything like this!

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Mon May 20 15:36:30 2024
    On Mon, 20 May 2024 14:22:00 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Mon, 20 May 2024 09:24:16 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:

    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it
    came to around 18 cycles per FMA. Compared to the 13 cycles for
    the FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would
    be an entirely different beast, I would expect a throughput of
    1 per cycle and a latency of (maybe) one cycle more than 64-bit
    FMA.
    The FMA normalizer has to handle a maximally bad cancellation, so
    it needs to be around 350 bits wide. Mitch knows of course but
    I'm guessing that this could at least be close to needing an
    extra cycle on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before

    They are not, this is part of what you do to make subnormal numbers
    exactly the same speed as normal inputs.

    Terje


    1. I am not sure that "the same speed" is a worthy goal even for
    binary64 (for binary32 it is).
    2. It's certainly does not sound like a worthy goal for binary128,
    where probability of encountering sub-normal inputs in real user
    code, rather than in test vector, is lower than DP by another order
    of magnitude,
    3. Even if, for reason unclear to me, it is considered the goal, it
    can be achieved by introduction of one more pipeline stage
    everywhere. Since we are discussing high-latency design akin to
    POWER9, the relative cost of another stage would be lower. BTW,
    according to POWER9 manual, even for SP/DP FMA the latency is not
    constant. It varies from 5 to 7.

    So, IMHO, what you do to handle sub-normal inputs should depend on
    what ends up smaller or faster, not on some abstract principles.
    For less important unit, like binary128, 'smaller' would likely take relative precedence over 'faster'. It's possible that you'll end up
    with not doing pre-normalization, but the reason for it would be
    different from 'same speed'.

    Besides, pre-normalization vs wider post-normalization are not the
    only available choices. When multiplier is naturally segmented into
    57-bit section, there exists, for example, an option of
    pre-normalization by full section. It looks very simple on the
    front and saves quite a lot of shifter's width on the back.

    But the best option is probably described in above post by Mitch.
    If I understood his post correctly, he suggests to have two
    alignment stages: one after multiplication and another one after
    add/sub. The shift count for a first stage is calculated from
    inputs in parallel with multiplication. The first alignment stage
    does not try to achieve a perfect normalizations, but it does
    enough for cutting the width of following adder from 3N to 2N+eps.

    I do agree with Mitch's suggestion: Allow subnormal inputs but do the
    partial muls from the top and move the normalization starting point
    down for each all-zero input block.

    In an extreme case (subnormal x subnormal) this would allow you to
    discard a lot of partial products.

    Terje


    For subnormal x subnormal you don't need result of multiplication at
    all. All you need to know is if it's zero or not and what sign.
    Even that is needed only in non-default rounding modes and for inexact
    flag in default mode.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Mon May 20 14:22:00 2024
    Michael S wrote:
    On Mon, 20 May 2024 09:24:16 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:

    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it
    came to around 18 cycles per FMA. Compared to the 13 cycles for
    the FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as
    an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would be
    an entirely different beast, I would expect a throughput of 1 per
    cycle and a latency of (maybe) one cycle more than 64-bit FMA.

    The FMA normalizer has to handle a maximally bad cancellation, so
    it needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra
    cycle on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before

    They are not, this is part of what you do to make subnormal numbers
    exactly the same speed as normal inputs.

    Terje


    1. I am not sure that "the same speed" is a worthy goal even for
    binary64 (for binary32 it is).
    2. It's certainly does not sound like a worthy goal for binary128,
    where probability of encountering sub-normal inputs in real user code,
    rather than in test vector, is lower than DP by another order of
    magnitude,
    3. Even if, for reason unclear to me, it is considered the goal, it can
    be achieved by introduction of one more pipeline stage everywhere.
    Since we are discussing high-latency design akin to POWER9, the
    relative cost of another stage would be lower. BTW, according to POWER9 manual, even for SP/DP FMA the latency is not constant. It varies from
    5 to 7.

    So, IMHO, what you do to handle sub-normal inputs should depend on what
    ends up smaller or faster, not on some abstract principles. For less important unit, like binary128, 'smaller' would likely take
    relative precedence over 'faster'. It's possible that you'll end up
    with not doing pre-normalization, but the reason for it would be
    different from 'same speed'.

    Besides, pre-normalization vs wider post-normalization are not the only available choices. When multiplier is naturally segmented into 57-bit section, there exists, for example, an option of pre-normalization by
    full section. It looks very simple on the front and saves quite a lot
    of shifter's width on the back.

    But the best option is probably described in above post by Mitch. If I understood his post correctly, he suggests to have two alignment stages:
    one after multiplication and another one after add/sub. The shift count
    for a first stage is calculated from inputs in parallel with
    multiplication. The first alignment stage does not try to achieve a
    perfect normalizations, but it does enough for cutting the width of
    following adder from 3N to 2N+eps.

    I do agree with Mitch's suggestion: Allow subnormal inputs but do the
    partial muls from the top and move the normalization starting point down
    for each all-zero input block.

    In an extreme case (subnormal x subnormal) this would allow you to
    discard a lot of partial products.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Mon May 20 11:44:19 2024
    Chris M. Thomasson wrote:
    On 5/19/2024 3:08 PM, Chris M. Thomasson wrote:
    On 5/19/2024 3:04 PM, Chris M. Thomasson wrote:
    On 5/19/2024 2:55 PM, Chris M. Thomasson wrote:
    [...]
    I remember a little test that Microsoft made wrt 50,000 concurrent
    OVERLAPPED ops in IOCP vs an event driven model actually creating a
    windows event per connection multiplexing WFMO in several threads.
    The event model did not perform as well, but it did not do too bad
    either. I wonder if I can still find that paper. Back in 2002 or
    something. Hard to remember right now.

    I am having trouble finding it. I do remember:

    https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc959494(v=technet.10)?redirectedfrom=MSDN



    I just found an old post from me back in 2003 with a link to the paper:
    ___________________
    You can get 50,000+ concurrent connections using IOCP, check out the
    following link:

    http://www.microsoft.com/mspress/books/sampchap/5726a.asp?#128

    You do have to do some memory management to get there, like posting zero
    byte receives to ensure that pending recvs don't lock their buffers,
    you can
    also restrict the amount of pending sends the server has all together
    [...]
    ___________________

    The way back machine found it, I think!

    https://web.archive.org/web/20030216222720/https://www.microsoft.com/mspress/books/sampchap/5726a.asp#128


    Nice!

    Thanks. I have never used the thread-per-client model in my servers.
    I have been using async I/O and Asynchronous Procedure Calls (APC)
    for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP),
    which were added to Windows later, have similar functionality
    (perhaps IOCP might have slightly better scaling with many cores).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon May 20 18:04:42 2024
    BGB wrote:

    On 5/19/2024 7:10 PM, MitchAlsup1 wrote:

    Kahan has several lectures about this....


    There have been apparently more things killed off by slow performance
    than by lack of FPU accuracy.

    Say, at the time, performance apparently killed off:
    Amiga (killed off by its slow graphics)
    Bit planar graphics rather sucking if one wants fast screen
    redraws;
    M68K, killed off for being too slow vs x86;
    Cyrix, because its Pentium equivalent was slow at running Quake;
    ...

    Mc68K (most of it at least) is living out its days as an automotive
    engine controller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Mon May 20 21:17:15 2024
    Michael S wrote:
    On Mon, 20 May 2024 14:22:00 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Mon, 20 May 2024 09:24:16 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:

    Thomas Koenig wrote:
    So, I did some more measurements on the POWER9 machine, and it
    came to around 18 cycles per FMA. Compared to the 13 cycles for >>>>>>> the FMA instruction, this actually sounds reasonable.

    The big problem appears to be that, in this particular
    implementation, multiplication is not pipelined, but done by
    piecewise by addition. This can be explained by the fact that
    this is mostly a decimal unit, with the 128-bit QP just added as >>>>>>> an afterthought, and decimal multiplication does not happen all
    that often.

    A fully pipelined FMA unit capable of 128-bit arithmetic would
    be an entirely different beast, I would expect a throughput of
    1 per cycle and a latency of (maybe) one cycle more than 64-bit
    FMA.
    The FMA normalizer has to handle a maximally bad cancellation, so
    it needs to be around 350 bits wide. Mitch knows of course but
    I'm guessing that this could at least be close to needing an
    extra cycle on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before

    They are not, this is part of what you do to make subnormal numbers
    exactly the same speed as normal inputs.

    Terje


    1. I am not sure that "the same speed" is a worthy goal even for
    binary64 (for binary32 it is).
    2. It's certainly does not sound like a worthy goal for binary128,
    where probability of encountering sub-normal inputs in real user
    code, rather than in test vector, is lower than DP by another order
    of magnitude,
    3. Even if, for reason unclear to me, it is considered the goal, it
    can be achieved by introduction of one more pipeline stage
    everywhere. Since we are discussing high-latency design akin to
    POWER9, the relative cost of another stage would be lower. BTW,
    according to POWER9 manual, even for SP/DP FMA the latency is not
    constant. It varies from 5 to 7.

    So, IMHO, what you do to handle sub-normal inputs should depend on
    what ends up smaller or faster, not on some abstract principles.
    For less important unit, like binary128, 'smaller' would likely take
    relative precedence over 'faster'. It's possible that you'll end up
    with not doing pre-normalization, but the reason for it would be
    different from 'same speed'.

    Besides, pre-normalization vs wider post-normalization are not the
    only available choices. When multiplier is naturally segmented into
    57-bit section, there exists, for example, an option of
    pre-normalization by full section. It looks very simple on the
    front and saves quite a lot of shifter's width on the back.

    But the best option is probably described in above post by Mitch.
    If I understood his post correctly, he suggests to have two
    alignment stages: one after multiplication and another one after
    add/sub. The shift count for a first stage is calculated from
    inputs in parallel with multiplication. The first alignment stage
    does not try to achieve a perfect normalizations, but it does
    enough for cutting the width of following adder from 3N to 2N+eps.

    I do agree with Mitch's suggestion: Allow subnormal inputs but do the
    partial muls from the top and move the normalization starting point
    down for each all-zero input block.

    In an extreme case (subnormal x subnormal) this would allow you to
    discard a lot of partial products.

    Terje


    For subnormal x subnormal you don't need result of multiplication at
    all. All you need to know is if it's zero or not and what sign.
    Even that is needed only in non-default rounding modes and for inexact
    flag in default mode.

    Yeah, Mea Culpa! I did correct that particular brain fart a few minutes
    later in my subsequent post, but it is not possible for the
    multiplication to produce a result far below the subnormal limit.

    As you note, it is only when using RoundToPlus (or Minus) Infinity that
    an arbitrary small product can still produce a non-zero result.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Mon May 20 19:58:18 2024
    BGB wrote:

    On 5/20/2024 7:36 AM, Michael S wrote:

    For subnormal x subnormal you don't need result of multiplication at
    all. All you need to know is if it's zero or not and what sign.
    Even that is needed only in non-default rounding modes and for inexact
    flag in default mode.


    For most non-tiny formats, the seeming advantage of subnormal numbers
    seems small, in any case.

    There is, it is called Posit (or UNUM depending).
    No subnormals, wider range then IEEE, more precision than IEEE
    (most of the time). Whether it is better overall is still a
    matter of debate. It is harder to implement than IEEE but
    just barely.

    But, yeah, in any case I would almost prefer if there could be a separate/cheaper standard, probably mostly aimed at
    embedded/microcontroller style use-cases (rather than "general
    purpose"), and would likely relax the requirements a fair bit.

    Say, likely target might be, say:
    FADD/FSUB/FMUL;
    Binary16 and Binary32 as high-priority formats;
    Binary64 as optional (but nice to have);
    Probably DAZ/FTZ;
    Potentially allow for truncate-only rounding.

    Assumption being that larger or higher precision cases would fall back
    to software emulation.


    Could optionally have some 8-bit FP formats, but 8-bit FP is a little
    bit too limited for general-purpose use.

    Likely main candidates being:
    S.E4.F3 (Bias=7)
    S.E3.F4 (Bias=7|8, ~ Unit Range)
    More or less A-Law without the XOR.
    Though, A-Law can also be interpreted as a ~ 12 bit integer value.
    Annoyingly, exact bias depends on context for this one
    (eg: 8/7/3/0)...

    I had also used:
    E4.F4
    E4.F3.S
    But, this is wonky (and the possible merit of E4.F3.S is defeated once
    one also needs S.E4.F3 or S.E3.F4, as these are the "actually used in
    the wild" formats, so may have been a mistake).

    I spent some of my youth trying to push against immovable objects
    (i.e., standards) don't do it, it is a waste of effort and time,
    similar to putting lipstick on a pig.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to [email protected] on Mon May 20 21:56:22 2024
    MitchAlsup1 <[email protected]> schrieb:
    BGB wrote:

    On 5/20/2024 7:36 AM, Michael S wrote:

    For subnormal x subnormal you don't need result of multiplication at
    all. All you need to know is if it's zero or not and what sign.
    Even that is needed only in non-default rounding modes and for inexact
    flag in default mode.


    For most non-tiny formats, the seeming advantage of subnormal numbers
    seems small, in any case.

    There is, it is called Posit (or UNUM depending).
    No subnormals, wider range then IEEE, more precision than IEEE
    (most of the time). Whether it is better overall is still a
    matter of debate. It is harder to implement than IEEE but
    just barely.

    My guess is that it will never catch on. Having accuracy depend
    on the number range is an idea that people who prove things about
    numerical algorithms tend to dislike.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Tue May 21 05:49:55 2024
    BGB <[email protected]> schrieb:

    Granted, they are not necessarily the option one would go if they wanted "cheapest possible FPU that is still good enough to be usable".

    Though, the point at which an FPU manages to suck badly enough that one
    needs to resort to software emulation to make software work, is probably
    a lower limit.


    Luckily, "uses 754 formats, but with aggressive cost cutting" can be
    "good enough", and so long as they more-or-less deliver a full width mantissa, and can exactly compute exact-value calculations, most
    software is generally going to work.

    This will require extensive testing and possibly modification for
    a lot of software ported to such a system. This will drive up
    the total cost, presumably far more than any hardware savings.

    But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so there
    is a lower limit here.

    An example of a more interesting question is

    if (a >= 0.) {
    if (b >= 0) {
    if (a + b < a) {
    printf("We should never get here!\n);
    abort();
    }
    }
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue May 21 10:46:59 2024
    On Mon, 20 May 2024 21:17:15 +0200
    Terje Mathisen <[email protected]> wrote:

    As you note, it is only when using RoundToPlus (or Minus) Infinity
    that an arbitrary small product can still produce a non-zero result.

    Terje



    I think, we were discussing multiplication stage of FMA rather than multiplication proper.
    In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
    all standard IEEE rounding mode except default (RNE).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Thomas Koenig on Tue May 21 11:19:03 2024
    On Tue, 21 May 2024 05:49:55 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    BGB <[email protected]> schrieb:

    Granted, they are not necessarily the option one would go if they
    wanted "cheapest possible FPU that is still good enough to be
    usable".

    Though, the point at which an FPU manages to suck badly enough that
    one needs to resort to software emulation to make software work, is probably a lower limit.


    Luckily, "uses 754 formats, but with aggressive cost cutting" can
    be "good enough", and so long as they more-or-less deliver a full
    width mantissa, and can exactly compute exact-value calculations,
    most software is generally going to work.

    This will require extensive testing and possibly modification for
    a lot of software ported to such a system. This will drive up
    the total cost, presumably far more than any hardware savings.

    But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
    there is a lower limit here.

    An example of a more interesting question is

    if (a >= 0.) {
    if (b >= 0) {
    if (a + b < a) {
    printf("We should never get here!\n);
    abort();
    }
    }
    }

    If I am not mistaken, that should hold on VAX, which has floating-point
    very close to BGB ideal. It looks like it would hold even on less
    robust formats, like IBM's hex float. I wonder where it is not true?

    The biggest difference between IEEE and VAX is that on IEEE when (a > b)
    then (a - b > 0) while on VAX (a - b >= 0).

    Of course, IEEE has non-intuitive cases as well.
    if (!(a < 0)) {
    if (!(b < 0)) {
    if (!(a + b >= a)) {
    printf("It's IEEE 754, baby!\n);
    }
    }
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Tue May 21 13:18:32 2024
    Michael S wrote:
    On Tue, 21 May 2024 05:49:55 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    BGB <[email protected]> schrieb:

    Granted, they are not necessarily the option one would go if they
    wanted "cheapest possible FPU that is still good enough to be
    usable".

    Though, the point at which an FPU manages to suck badly enough that
    one needs to resort to software emulation to make software work, is
    probably a lower limit.


    Luckily, "uses 754 formats, but with aggressive cost cutting" can
    be "good enough", and so long as they more-or-less deliver a full
    width mantissa, and can exactly compute exact-value calculations,
    most software is generally going to work.

    This will require extensive testing and possibly modification for
    a lot of software ported to such a system. This will drive up
    the total cost, presumably far more than any hardware savings.

    But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
    there is a lower limit here.

    An example of a more interesting question is

    if (a >= 0.) {
    if (b >= 0) {
    if (a + b < a) {
    printf("We should never get here!\n);
    abort();
    }
    }
    }

    If I am not mistaken, that should hold on VAX, which has floating-point
    very close to BGB ideal. It looks like it would hold even on less
    robust formats, like IBM's hex float. I wonder where it is not true?

    The biggest difference between IEEE and VAX is that on IEEE when (a > b)
    then (a - b > 0) while on VAX (a - b >= 0).

    Of course, IEEE has non-intuitive cases as well.
    if (!(a < 0)) {
    if (!(b < 0)) {
    if (!(a + b >= a)) {
    printf("It's IEEE 754, baby!\n);
    }
    }
    }

    What happens when a and/or b is a NaN?

    Comparisons with NaN should return false, so !(NaN < 0) will be true
    (and the same for b), while !(NaN+b >= NaN) will also return true.

    Is that what you were thinking of?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Tue May 21 14:28:01 2024
    On Tue, 21 May 2024 13:18:32 +0200
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Tue, 21 May 2024 05:49:55 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    BGB <[email protected]> schrieb:

    Granted, they are not necessarily the option one would go if they
    wanted "cheapest possible FPU that is still good enough to be
    usable".

    Though, the point at which an FPU manages to suck badly enough
    that one needs to resort to software emulation to make software
    work, is probably a lower limit.


    Luckily, "uses 754 formats, but with aggressive cost cutting" can
    be "good enough", and so long as they more-or-less deliver a full
    width mantissa, and can exactly compute exact-value calculations,
    most software is generally going to work.

    This will require extensive testing and possibly modification for
    a lot of software ported to such a system. This will drive up
    the total cost, presumably far more than any hardware savings.

    But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
    there is a lower limit here.

    An example of a more interesting question is

    if (a >= 0.) {
    if (b >= 0) {
    if (a + b < a) {
    printf("We should never get here!\n);
    abort();
    }
    }
    }

    If I am not mistaken, that should hold on VAX, which has
    floating-point very close to BGB ideal. It looks like it would hold
    even on less robust formats, like IBM's hex float. I wonder where
    it is not true?

    The biggest difference between IEEE and VAX is that on IEEE when (a
    b) then (a - b > 0) while on VAX (a - b >= 0).

    Of course, IEEE has non-intuitive cases as well.
    if (!(a < 0)) {
    if (!(b < 0)) {
    if (!(a + b >= a)) {
    printf("It's IEEE 754, baby!\n);
    }
    }
    }

    What happens when a and/or b is a NaN?

    Comparisons with NaN should return false, so !(NaN < 0) will be true
    (and the same for b), while !(NaN+b >= NaN) will also return true.

    Is that what you were thinking of?

    Yes


    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue May 21 17:17:10 2024
    BGB wrote:

    But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so there

    is a lower limit here.

    1.0 has a fraction of 0
    2.0 has a fraction of 0
    1.0+2.0 has a fraction with a single HOB set.
    all 3 above examples have the hidden bit set.

    Any implementation purporting to be IEEE 754 better not give anything
    other
    than 3.0 !!

    Bad example. Even IBM and CRAY arithmetics, with all their problems,
    were
    not that bad.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue May 21 17:11:22 2024
    Michael S wrote:

    On Mon, 20 May 2024 21:17:15 +0200
    Terje Mathisen <[email protected]> wrote:

    As you note, it is only when using RoundToPlus (or Minus) Infinity
    that an arbitrary small product can still produce a non-zero result.

    Terje



    I think, we were discussing multiplication stage of FMA rather than multiplication proper.
    In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
    all standard IEEE rounding mode except default (RNE).


    Imagine, instead, if IEEE 754 had defined positive underflow with the
    result of positive tiny, negative underflow with negative tiny,
    positive overflow with positive infinity-epsilon and negative
    overflow with negative infinity+epsilon.

    Here, the fact overflow or underflow happened is recorded in the
    result, and these results remain identifiable from real infinities
    or real zeros.

    But that ship sailed 50 years ago.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue May 21 17:19:02 2024
    BGB wrote:


    Errm, I was promoting the idea of cost-cut floating point, not blatantly

    broken floating point...

    Would you promote the idea where the customer could specify whether his
    car
    had air bags and crash safety cell or not ??

    Same point here.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Michael S on Tue May 21 17:53:47 2024
    Michael S <[email protected]> schrieb:
    On Tue, 21 May 2024 05:49:55 -0000 (UTC)
    Thomas Koenig <[email protected]> wrote:

    BGB <[email protected]> schrieb:

    Granted, they are not necessarily the option one would go if they
    wanted "cheapest possible FPU that is still good enough to be
    usable".

    Though, the point at which an FPU manages to suck badly enough that
    one needs to resort to software emulation to make software work, is
    probably a lower limit.


    Luckily, "uses 754 formats, but with aggressive cost cutting" can
    be "good enough", and so long as they more-or-less deliver a full
    width mantissa, and can exactly compute exact-value calculations,
    most software is generally going to work.

    This will require extensive testing and possibly modification for
    a lot of software ported to such a system. This will drive up
    the total cost, presumably far more than any hardware savings.

    But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
    there is a lower limit here.

    An example of a more interesting question is

    if (a >= 0.) {
    if (b >= 0) {
    if (a + b < a) {
    printf("We should never get here!\n);
    abort();
    }
    }
    }

    If I am not mistaken, that should hold on VAX, which has floating-point
    very close to BGB ideal. It looks like it would hold even on less
    robust formats, like IBM's hex float. I wonder where it is not true?

    IIRC, something like that was possible when mixing 80-bit and 64-bit
    quantities on x387.

    But the code I posted probably does not qualify for this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue May 21 15:56:47 2024
    I think, we were discussing multiplication stage of FMA rather than
    multiplication proper.
    In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
    all standard IEEE rounding mode except default (RNE).
    Imagine, instead, if IEEE 754 had defined positive underflow with the
    result of positive tiny, negative underflow with negative tiny,
    positive overflow with positive infinity-epsilon and negative
    overflow with negative infinity+epsilon.
    Here, the fact overflow or underflow happened is recorded in the
    result, and these results remain identifiable from real infinities
    or real zeros.
    But that ship sailed 50 years ago.

    Wouldn't that just kick the problem down the street?
    For example, what should `x < y` return when both `x` and `y` are "infinity+epsilon"?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Tue May 21 22:20:20 2024
    Stefan Monnier wrote:

    I think, we were discussing multiplication stage of FMA rather than
    multiplication proper.
    In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
    all standard IEEE rounding mode except default (RNE).
    Imagine, instead, if IEEE 754 had defined positive underflow with the
    result of positive tiny, negative underflow with negative tiny,
    positive overflow with positive infinity-epsilon and negative
    overflow with negative infinity+epsilon.
    Here, the fact overflow or underflow happened is recorded in the
    result, and these results remain identifiable from real infinities
    or real zeros.
    But that ship sailed 50 years ago.

    Wouldn't that just kick the problem down the street?
    For example, what should `x < y` return when both `x` and `y` are "infinity+epsilon"?

    You mean -infinity+epsilon or +infinity-epsilon. +infinity+epsilon
    is +infinity ...

    IEEE 754 has +infinity == +infinity && -infinity == -infinity


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB on Tue May 21 22:12:43 2024
    BGB wrote:


    Currently, FCMPEQ will be false on NaN, whereas FCMPGT ignores NaN's.

    I guess possible could be to add an FCMPGE instruction, and then make
    both FCMPGT and FCMPGE be false on NaN (where the LT and LE cases can be

    handled by flipping the arguments, so would still be false on NaN).

    So, as-is:
    if(!(a==a))
    {
    //NaN
    }

    But, as-is:
    if(a>0)
    {
    //May still potentially get here with NaN
    }

    Not when the compare is true to IEEE 754. When there is a floating
    point
    compare and one of the operands is NaN, none of the 6 standard
    comparison
    forms are true and control transfers to the else-clause.

    In your top example the ! (not) causes NaNs to go to the then-clause

    In your bottom example, no NaN is allowed into the then-clause.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed May 22 09:12:55 2024
    MitchAlsup1 wrote:
    BGB wrote:


    Errm, I was promoting the idea of cost-cut floating point, not blatantly

    broken floating point...

    Would you promote the idea where the customer could specify whether his
    car had air bags and crash safety cell or not ??

    Same point here.

    We do have that in the form of having optional air bags: When we bought
    a Skoda Octavia many years ago, the head-protecting upper air bags were
    not required by law so we paid to get them as an option.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed May 22 09:10:40 2024
    MitchAlsup1 wrote:
    Michael S wrote:

    On Mon, 20 May 2024 21:17:15 +0200
    Terje Mathisen <[email protected]> wrote:

    As you note, it is only when using RoundToPlus (or Minus) Infinity
    that an arbitrary small product can still produce a non-zero result.

    Terje



    I think, we were discussing multiplication stage of FMA rather than
    multiplication proper.
    In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
    all standard IEEE rounding mode except default (RNE).


    Imagine, instead, if IEEE 754 had defined positive underflow with the
    result of positive tiny, negative underflow with negative tiny,
    positive overflow with positive infinity-epsilon and negative
    overflow with negative infinity+epsilon.

    Here, the fact overflow or underflow happened is recorded in the
    result, and these results remain identifiable from real infinities
    or real zeros.

    But that ship sailed 50 years ago.

    Not entirely: I was recently very surprised to learn that in non-default rounding modes, you can in fact get behavior close to but not quite what
    you want. I.e. if rounding would cause overflow from maximally large
    normal to infinity, then the rounding up is suppressed.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Wed May 22 10:22:33 2024
    MitchAlsup1 wrote:


    Imagine, instead, if IEEE 754 had defined positive underflow with the
    result of positive tiny, negative underflow with negative tiny,
    positive overflow with positive infinity-epsilon and negative
    overflow with negative infinity+epsilon.

    Here, the fact overflow or underflow happened is recorded in the
    result, and these results remain identifiable from real infinities
    or real zeros.

    But that ship sailed 50 years ago.

    If an overflow occurs and that exception is masked, x86 returns a
    value of either +-largest finite number (LFN) or +-infinity (INF),
    depending on the rounding mode.
    (vol-1 Arch manual, section 4.9.1.4, Numeric Overflow Exception,
    Table 4-10. Masked Responses to Numeric Overflow)

    Rounding_Mode Sign_of_Result Result
    ------------- -------------- -------------------------------
    To nearest + +∞
    – –∞
    Toward –∞ + Largest finite positive number
    – –∞
    Toward +∞ + +∞
    – Largest finite negative number
    Toward zero + Largest finite positive number
    – Largest finite negative number


    The difference seems to be that INF is a sticky overflow, LFN is not.
    Would this not satisfy everyone?

    The problem is that it requires diddling the control register to change
    the round mode, as opposed to round mode on each float instruction.
    Or maybe even the LFN/INF overflow choice should be a separate option independent of round control bits.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Wed May 22 16:52:32 2024
    EricP wrote:

    MitchAlsup1 wrote:


    Imagine, instead, if IEEE 754 had defined positive underflow with the
    result of positive tiny, negative underflow with negative tiny,
    positive overflow with positive infinity-epsilon and negative
    overflow with negative infinity+epsilon.

    Here, the fact overflow or underflow happened is recorded in the
    result, and these results remain identifiable from real infinities
    or real zeros.

    But that ship sailed 50 years ago.

    If an overflow occurs and that exception is masked, x86 returns a
    value of either +-largest finite number (LFN) or +-infinity (INF),
    depending on the rounding mode.
    (vol-1 Arch manual, section 4.9.1.4, Numeric Overflow Exception,
    Table 4-10. Masked Responses to Numeric Overflow)

    Rounding_Mode Sign_of_Result Result
    ------------- -------------- -------------------------------
    To nearest + +∞
    – –∞
    Toward –∞ + Largest finite positive number
    – –∞
    Toward +∞ + +∞
    – Largest finite negative number
    Toward zero + Largest finite positive number
    – Largest finite negative number


    The difference seems to be that INF is a sticky overflow, LFN is not.
    Would this not satisfy everyone?

    The problem is that it requires diddling the control register to change
    the round mode, as opposed to round mode on each float instruction.
    Or maybe even the LFN/INF overflow choice should be a separate option independent of round control bits.

    Yes, what we need is a rounding mode that suppresses overflow and
    underflow but is otherwise RNE.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Thu May 23 21:39:12 2024
    BGB-Alt wrote:

    On 5/20/2024 7:28 AM, Terje Mathisen wrote:
    Anton Ertl wrote:
    Michael S <[email protected]> writes:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:
    The FMA normalizer has to handle a maximally bad cancellation, so it >>>>> needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra cycle >>>>> on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before
    multiplication, the product of multiplication is 226 bits

    The product of the mantissa multiplication is at most 226 bits even if
    you don't normalize subnormal numbers.  For cancellation to play a
    role the addend has to be close in absolute value and have the
    opposite sign as the product, so at most one additional bit comes into
    play for that case (for something like the product being
    0111111... and the addend being -10000000...).

    This is the part of Mitch's explanation that I have never been able to
    totally grok, I do think you could get away with less bits, but only if

    you can collapse the extra mantissa bits into sticky while aligning the

    product with the addend. If that takes too long or it turns out to be
    easier/faster in hardware to simply work with a much wider mantissa,
    then I'll accept that.

    I don't think I've ever seen Mitch make a mistake on anything like
    this!


    It is a mystery, though seems like maybe Binary128 FMA could be done in

    software via an internal 384-bit intermediate?...

    My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to
    228 bits. If one adds another 116 bits (for maximal FADD), this comes
    to

    344.

    Maximal product with minimal augend::

    pppppppp-pppppppp-aaaaaaaa

    Maximal augend with minimal product

    aaaaaaaa-pppppppp-pppppppp

    So the way one builds HW is to have the augend shifter cover the whole

    length and place the product in the middle::

    max min
    aaaaaaaa-aaaaaaaa-aaaaaaaa-aaaaaaaa
    pppppppp-pppppppp

    The output of the product is still in carry-save form and the augend is
    in pure binary so the adder is 3-input for 2×-width. This generates a
    carry into the high order incrementor.

    So one has a sticky generator for the right hand side augend, and an incrementor for the left hand side augend. When doing high speed de-
    normals one cannot count on the left hand side of product to have HoBs
    set with standard ramifications (imaging a denorm product and a denorm
    augend and you want the right answer.)

    Any way you cook it, you have a 4× wide intermediate (minus 2-bits
    IIRC).
    4×112 = 448 -2 = 446.

    There is a reason these things are not standard at this point of
    technology.

    Could you do it (IEEE accuracy) with less HW--yes, but only if you
    allow
    certain special cases to take more cycles in calculation. At a certain
    point (a point made by Terje) it is easier to implement with wide
    integer
    calculations 128+128 and/or 128*128 along with double width shifts,
    inserts,
    and extracts.

    IEEE did not make these things any easier by having a 2× std width
    fraction
    have 2×+3 bits of length requiring 8 multiplications with minimal HW
    instead
    of 4 multiplications. On the other hand IBM did us no favors with Hex
    FP
    either (keeping the exponent size the same and having 2×+8 bits of
    fraction.)

    In this case, 384 bits would be because my "_BitInt" support code pads
    things to a multiple of 128 bits (for integer types larger than 256
    bits).


    It isn't fast, but I am not against having Binary128 being slower,
    since

    if one is using Binary128 ("long double" or "__float128" in this case),

    it is likely the case that precision is more a priority than speed.

    Though, as of yet, there is no Binary128 FMA operation (in the software

    runtime). Could potentially add this in theory.


    I guess, maybe also possible could be whether to add the
    FADDX/FMULX/FMACX instructions in a form where they are allowed, but
    will be turned into runtime traps (would likely route them through the
    TLB Miss ISR, which thus far has ended up as a catch-all for this sort
    of thing...).

    Though, likely more efficient would still be "just use the runtime
    calls".

    Terje



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Fri May 24 09:07:35 2024
    MitchAlsup1 wrote:
    BGB-Alt wrote:

    On 5/20/2024 7:28 AM, Terje Mathisen wrote:
    Anton Ertl wrote:
    Michael S <[email protected]> writes:
    On Sun, 19 May 2024 18:37:51 +0200
    Terje Mathisen <[email protected]> wrote:
    The FMA normalizer has to handle a maximally bad cancellation, so it >>>>>> needs to be around 350 bits wide. Mitch knows of course but I'm
    guessing that this could at least be close to needing an extra cycle >>>>>> on its own and/or heroic hardware?

    Terje


    Why so wide?
    Assuming that subnormal multiplier inputs are normalized before
    multiplication, the product of multiplication is 226 bits

    The product of the mantissa multiplication is at most 226 bits even if >>>> you don't normalize subnormal numbers.  For cancellation to play a
    role the addend has to be close in absolute value and have the
    opposite sign as the product, so at most one additional bit comes into >>>> play for that case (for something like the product being
    0111111... and the addend being -10000000...).

    This is the part of Mitch's explanation that I have never been able
    to totally grok, I do think you could get away with less bits, but
    only if

    you can collapse the extra mantissa bits into sticky while aligning the

    product with the addend. If that takes too long or it turns out to be
    easier/faster in hardware to simply work with a much wider mantissa,
    then I'll accept that.

    I don't think I've ever seen Mitch make a mistake on anything like
    this!


    It is a mystery, though seems like maybe Binary128 FMA could be done in

    software via an internal 384-bit intermediate?...

    My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to
    228 bits. If one adds another 116 bits (for maximal FADD), this comes
    to

    344.

    Maximal product with minimal augend::

        pppppppp-pppppppp-aaaaaaaa

    Maximal augend with minimal product

        aaaaaaaa-pppppppp-pppppppp

    So the way one builds HW is to have the augend shifter cover the whole

    length and place the product in the middle::

           max                        min
        aaaaaaaa-aaaaaaaa-aaaaaaaa-aaaaaaaa
                 pppppppp-pppppppp

    The output of the product is still in carry-save form and the augend is
    in pure binary so the adder is 3-input for 2×-width. This generates a
    carry into the high order incrementor.

    So one has a sticky generator for the right hand side augend, and an incrementor for the left hand side augend. When doing high speed de-
    normals one cannot count on the left hand side of product to have HoBs
    set with standard ramifications (imaging a denorm product and a denorm
    augend and you want the right answer.)

    Any way you cook it, you have a 4× wide intermediate (minus 2-bits
    IIRC).
    4×112 = 448 -2 = 446.
    There is a reason these things are not standard at this point of
    technology.

    So this is basically due to the product part still being in carry-save
    format, so it cannot easily be moved/aligned, instead the augend has to
    be able to move to either side of it. OK, that makes sense!



    Could you do it (IEEE accuracy) with less HW--yes, but only if you
    allow
    certain special cases to take more cycles in calculation. At a certain
    point (a point made by Terje) it is easier to implement with wide
    integer
    calculations 128+128 and/or 128*128 along with double width shifts,
    inserts,
    and extracts.

    IEEE did not make these things any easier by having a 2× std width
    fraction have 2×+3 bits of length requiring 8 multiplications with
    minimal HW instead of 4 multiplications. On the other hand IBM did us
    no favors with Hex FP either (keeping the exponent size the same and
    having 2×+8 bits of fraction.)

    This is an intentional feature, not a bug!

    By making sure that all ieee larger formats have a mantissa with at
    least 2n+3 bits compared to the smaller format below, you avoid all
    double rounding issues if you do a calculation in the larger format and
    then immediately store it back to a smaller format container.

    By also having a wider exponent you can do things like sqrt(x^2+y^2) and completely avoid spuriouos overflows during the squaring ops: As long as
    the final result fits in float, it will be the correct result.

    We started out with 1:8:23 and 1:11:52, then we got 1:15:112 at the
    higher end and 1:5:10 for fp16 and 1:3:4 for fp8.

    Do note that the 8 and 16-bit variants do break the 2n+3 rule, also note
    that the AI training people like truncated 32-bit, i.e. 1:8:7 which
    keeps the full float range but with ~1/3 the mantissa resolution.

    Anyway, doing fp128 in SW I would of course do it using u64 unsigned
    integer ops: FMUL128 becomes 4 64x64->128 MUL ops plus the
    adding/merging of the terms and a bunch of book keeping work on the
    signs and exponents.

    With a single fully pipelined integer multiplier taking 4 cycles, this
    would be 7 cycles for the MULs, with the last three cycles overlapped
    with the initial ADD/ADC operations. Seems like it could be doable in
    sub-20 cycles?

    I'm assuming the CPU to be wide enough that the special cases can be
    handled in parallel with the default/normal inputs case, also assuming
    reg-reg MOVes to be zero cycles, handled in the renamer, in order to
    overcome the dedicated register (RDX) issue which we have retained even
    using MULX.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Fri May 24 09:43:19 2024
    Chris M. Thomasson wrote:
    On 5/20/2024 8:44 AM, EricP wrote:
    Chris M. Thomasson wrote:
    On 5/19/2024 3:08 PM, Chris M. Thomasson wrote:
    On 5/19/2024 3:04 PM, Chris M. Thomasson wrote:
    On 5/19/2024 2:55 PM, Chris M. Thomasson wrote:
    [...]
    I remember a little test that Microsoft made wrt 50,000 concurrent >>>>>> OVERLAPPED ops in IOCP vs an event driven model actually creating
    a windows event per connection multiplexing WFMO in several
    threads. The event model did not perform as well, but it did not
    do too bad either. I wonder if I can still find that paper. Back
    in 2002 or something. Hard to remember right now.

    I am having trouble finding it. I do remember:

    https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc959494(v=technet.10)?redirectedfrom=MSDN



    I just found an old post from me back in 2003 with a link to the paper: >>>> ___________________
    You can get 50,000+ concurrent connections using IOCP, check out the
    following link:

    http://www.microsoft.com/mspress/books/sampchap/5726a.asp?#128

    You do have to do some memory management to get there, like posting
    zero
    byte receives to ensure that pending recvs don't lock their buffers,
    you can
    also restrict the amount of pending sends the server has all together
    [...]
    ___________________

    The way back machine found it, I think!

    https://web.archive.org/web/20030216222720/https://www.microsoft.com/mspress/books/sampchap/5726a.asp#128


    Nice!

    Thanks. I have never used the thread-per-client model in my servers.
    I have been using async I/O and Asynchronous Procedure Calls (APC)
    for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP),
    which were added to Windows later, have similar functionality
    (perhaps IOCP might have slightly better scaling with many cores).

    I never tried APC wrt IO and a socket server on windows before! I have created several types wrt the book I finally found on the wayback
    machine. IOCP was the thing to use.

    APC's are I/O completion callback subroutines with 1 to 3 arguments.
    I use them to a build callback driven state machines for each network
    I/O channel, similar to device drivers. Each network channel requires
    only a small amount of user mode memory, and all server network connections
    can be serviced by a single thread or a small fixed pool of comms threads.
    This keeps the cost for each new connection to mostly just the kernel's
    network memory.

    WinNT originally only had APC's. It inherited the concept from
    VMS's Asynchronous System Trap (AST), which inherited the concept
    from RSX-11 on PDP-11.

    The difference between Windows APC and RSX/VMS AST is how they are delivered. Despite the name, Windows user mode APC's are NOT delivered to the thread asynchronously as interrupts but only at specified delivery points,
    which means user mode APC's are essentially a synchronous polled delivery. whereas VMS AST's are delivered at any time using interrupts semantics. (Windows does have real asynchronous-delivery APC's but inside the kernel
    where they are used to interrupt or wake up a thread for I/O completion
    and various other things.)

    User mode APC's are simpler from a user mode programming point of view than VMS's AST's but because APC's have a polled delivery you can't use user mode APC's to interrupt and force a thread to do something, as AST's can.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to BGB-Alt on Fri May 24 23:29:13 2024
    BGB-Alt wrote:

    Don't you EVER snip useless crap off the top of your posts ??

    In my case, BGBCC supports __int128 operations whether or not the ALUX instructions are enabled (along with _BitInt, *1).

    I am guessing that your CPU does not do 128, 256, 384 bit calculations natively, but the compiler supports those by emitting sequences of instructions.

    *1:
    _BitInt(56) x0; //maps to 64-bit
    _BitInt(64) x1; //maps to 64-bit
    _BitInt(80) x2; //maps to 128-bit
    _BitInt(128) x3; //maps to 128-bit
    _BitInt(160) x4; //maps to 256-bit
    _BitInt(256) x5; //maps to 256-bit
    _BitInt(272) x6; //maps to 384-bit
    ...
    All sizes beyond 256 bit mapping to the next integer multiple of 128
    bits. The 256-bit type is special, in that it has its own dedicated
    logic, but exists via the _BitInt type. For 384 and beyond, generic
    logic is used that deals with any size value, but is slower.

    Can note that in my implementation, BitInt does not enforce modulo
    behavior in the case of overflow (it is modulo only to the size of the container; enforcing odd-bit modulo behavior would add a fair bit of
    cost to using them).

    The multi-precision arithmetic in My 66000 supports rather arbitrary
    width calculations, although only those 256-bits and smaller can be
    considered fast and/or efficient. In addition, both signed and
    unsigned multi-precision arithmetic is available.


    Terje


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Sun May 26 16:51:31 2024
    On Fri, 24 May 2024 09:43:19 -0400, EricP
    <[email protected]> wrote:

    Chris M. Thomasson wrote:
    On 5/20/2024 8:44 AM, EricP wrote:

    Thanks. I have never used the thread-per-client model in my servers.
    I have been using async I/O and Asynchronous Procedure Calls (APC)
    for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP), >>> which were added to Windows later, have similar functionality
    (perhaps IOCP might have slightly better scaling with many cores).

    I never tried APC wrt IO and a socket server on windows before! I have
    created several types wrt the book I finally found on the wayback
    machine. IOCP was the thing to use.

    APC's are I/O completion callback subroutines with 1 to 3 arguments.
    I use them to a build callback driven state machines for each network
    I/O channel, similar to device drivers. Each network channel requires
    only a small amount of user mode memory, and all server network connections >can be serviced by a single thread or a small fixed pool of comms threads. >This keeps the cost for each new connection to mostly just the kernel's >network memory.

    WinNT originally only had APC's. It inherited the concept from
    VMS's Asynchronous System Trap (AST), which inherited the concept
    from RSX-11 on PDP-11.

    I can't speak to "originally" as I never used NT3.x, but NT4.x allowed asynchronous I/O calls to signal events on completion (or failure). I
    used events with WaitForMultipleObjects [*] to mix file and socket
    operations in single-thread servers.

    [*] like select() or poll() in Unix. For a long time the Windows
    "WaitFor..." calls could NOT directly monitor sockets (sockets were
    not files), but they could could monitor user events, and both the
    file and socket APIs supported using completion events.


    APCs might have been more efficient, but I only ever used them in
    conjunction with threads - I never tried to write a single-thread
    server that performed operations on multiple files or sockets using
    only APC.


    The difference between Windows APC and RSX/VMS AST is how they are delivered. >Despite the name, Windows user mode APC's are NOT delivered to the thread >asynchronously as interrupts but only at specified delivery points,
    which means user mode APC's are essentially a synchronous polled delivery. >whereas VMS AST's are delivered at any time using interrupts semantics. >(Windows does have real asynchronous-delivery APC's but inside the kernel >where they are used to interrupt or wake up a thread for I/O completion
    and various other things.)

    User mode APC's are simpler from a user mode programming point of view than >VMS's AST's but because APC's have a polled delivery you can't use user mode >APC's to interrupt and force a thread to do something, as AST's can.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to George Neuner on Mon May 27 10:46:30 2024
    George Neuner wrote:
    On Fri, 24 May 2024 09:43:19 -0400, EricP
    <[email protected]> wrote:

    Chris M. Thomasson wrote:
    On 5/20/2024 8:44 AM, EricP wrote:

    Thanks. I have never used the thread-per-client model in my servers.
    I have been using async I/O and Asynchronous Procedure Calls (APC)
    for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP), >>>> which were added to Windows later, have similar functionality
    (perhaps IOCP might have slightly better scaling with many cores).
    I never tried APC wrt IO and a socket server on windows before! I have
    created several types wrt the book I finally found on the wayback
    machine. IOCP was the thing to use.
    APC's are I/O completion callback subroutines with 1 to 3 arguments.
    I use them to a build callback driven state machines for each network
    I/O channel, similar to device drivers. Each network channel requires
    only a small amount of user mode memory, and all server network connections >> can be serviced by a single thread or a small fixed pool of comms threads. >> This keeps the cost for each new connection to mostly just the kernel's
    network memory.

    WinNT originally only had APC's. It inherited the concept from
    VMS's Asynchronous System Trap (AST), which inherited the concept
    from RSX-11 on PDP-11.

    I can't speak to "originally" as I never used NT3.x, but NT4.x allowed asynchronous I/O calls to signal events on completion (or failure). I
    used events with WaitForMultipleObjects [*] to mix file and socket
    operations in single-thread servers.

    Since NT 3.1 IO completion was indicated by either an event flag set or APC. Internally it uses a kernel mode APC to wake up the thread,
    which then cleans up after the IO and the last thing that APC
    does is either repost itself to the thread as a user mode APC
    or it sets the requested event flag and deletes itself.
    Later they added IO Completion Ports.

    But often layered software packages didn't support this which is
    why I make direct Windows OS calls whenever possible.

    [*] like select() or poll() in Unix. For a long time the Windows "WaitFor..." calls could NOT directly monitor sockets (sockets were
    not files), but they could could monitor user events, and both the
    file and socket APIs supported using completion events.

    The problem with the event approach is that WaitForXxx only allows up to
    64 wait objects, and you would need 2 events per network connection,
    one for send and one receive. That WaitFor limit in turn forces the thread-per-client model.

    Whereas I want a server to wait for arbitrary numbers of clients IO's, hundreds, thousands, tens of thousands..., as many as kernel memory allows.

    That's not to say that completion routines don't have their idiosyncrasies. Like everything in Windows land, you have to discover these.

    I originally used named pipes between Windows machines.
    But if I was using sockets I would have used direct calls to WSA
    like WSARecv and WSASend which do support completion routines,
    and not used the standard socket interface libraries.

    https://learn.microsoft.com/en-us/windows/win32/winsock/winsock-functions https://learn.microsoft.com/en-us/windows/win32/api/Winsock2/nf-winsock2-wsarecv

    None of my Windows code will ever be ported to another platform so compatibility with *nix is irrelevant and the Linux AIO functions
    are completely different anyway.

    APCs might have been more efficient, but I only ever used them in
    conjunction with threads - I never tried to write a single-thread
    server that performed operations on multiple files or sockets using
    only APC.

    I don't use them for execution speed efficiency, I use them for the
    internal code structure they allow and to minimize kernel resource usage.

    One still needs a pool of worker threads to deal with things like
    CreateFile which does not support async file open and waits the
    calling thread until finished, which would be disaster for a server
    thread and can hang a client GUI interface too.

    Also there are functions like WSAAccept or closesocket which don't support completion routines and look like they can potentially block/hang so you
    can't do everything through completion callbacks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Chris M. Thomasson on Sun Jun 2 10:49:23 2024
    Chris M. Thomasson wrote:

    It's a bit difficult for me to remember right now, but did you use
    Kernel APC's? Iirc, the pthread-win32 lib used kernel APC's to help
    emulate breaking into a thread at any time. Think of PTHREAD_CANCEL_ASYNCHRONOUS. Here is the lib:

    https://sourceware.org/pthreads-win32/

    Way back, I used this lib all the time. However, I never used pthread cancellation. Just never liked it.

    [...]

    No, I used the normal alertable (synchronous) user mode APC's.

    At some point MS added real asynchronous user mode APC's to the kernel,
    but for some bizarre reason they hobbled it so only they could use them.

    I just noticed now that Windows 11 has added a function to allow
    general posting of asynchronous user mode APC's.

    In Win11, see Special user-mode APCs in QueueUserAPC2 function https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-queueuserapc2

    and it seems that it is used to perform pthread_cancel.

    https://repnz.github.io/posts/apc/user-apc/#ntqueueapcthreadex2-some-new-friends-in-the-fast-ring

    When I got the first beta about 30 years ago, I noticed this missing functionality so a made my own using SuspendThread, GetThreadContext, SetThreadContext, and ResumeThread, which edits the thread context
    and its stack to force a subroutine call onto one of my thread's stack.
    Very dangerous because pretty much none of Win32 code was fully reentrant.

    Looking at pThreads thread_cancel you can see them doing something similar
    in ptw32_RegisterCancelation if QueueUserAPCEx routine is not available.

    ftp://sourceware.org/pub/pthreads-win32/sources/pthreads-w32-2-9-1-release/pthread_cancel.c

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Mon Jun 10 02:46:16 2024
    On Mon, 13 May 2024 21:16:48 +0000, MitchAlsup1 wrote:

    Emulation is slow when trap overhead is large and not-slow when trap
    overhead is small.

    I think it was a particular version of the old Mac OS, from around 1990 or
    so, that implemented a really amazing hack. Some 32-bit machines had
    hardware floating-point, others didn’t. So developers of numerics-
    intensive apps had to build two versions of their code, one with the floating-point instructions, the other with calls to Apple’s SANE library.

    The hack involved running code built to use hardware floating-point instructions, on hardware that didn’t have them. The instructions were of course trapped and emulated. But more than that, the system would patch
    the instruction that caused the trap, turning it into a direct call into
    the emulation routine. So after the first execution, each such instruction would run much faster. Until the code got unloaded from RAM and the patch
    was lost, of course.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Mon Jun 10 08:07:28 2024
    Lawrence D'Oliveiro wrote:
    On Mon, 13 May 2024 21:16:48 +0000, MitchAlsup1 wrote:

    Emulation is slow when trap overhead is large and not-slow when trap
    overhead is small.

    I think it was a particular version of the old Mac OS, from around 1990 or so, that implemented a really amazing hack. Some 32-bit machines had
    hardware floating-point, others didn’t. So developers of numerics- intensive apps had to build two versions of their code, one with the floating-point instructions, the other with calls to Apple’s SANE library.

    The hack involved running code built to use hardware floating-point instructions, on hardware that didn’t have them. The instructions were of
    course trapped and emulated. But more than that, the system would patch
    the instruction that caused the trap, turning it into a direct call into
    the emulation routine. So after the first execution, each such instruction would run much faster. Until the code got unloaded from RAM and the patch
    was lost, of course.

    This only works when each FP instruction is at least as long as a
    function call. This particular approach was standard on PCs more or less
    from the very beginning (i.e. 1981++):

    You could build applicatons with direct 8087 instructions, with pure sw emulation via CALL FDIV_emulation etc, or in a mode where each emitted
    hw fp instruction was followed by enough NOPs to make the total length
    at least 5 bytes: This way the missing HW trap handler could patch them
    into CALLs (possibly followed by one or more NOPS if the HW opcode was
    very long) instead.

    Since all those 8087 instructions were _very_ slow (30-300 clock
    cycles?), executiong an extra NOP or two made no discernible difference.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Terje Mathisen on Mon Jun 10 11:20:14 2024
    On 2024-06-10 9:07, Terje Mathisen wrote:
    Lawrence D'Oliveiro wrote:
    On Mon, 13 May 2024 21:16:48 +0000, MitchAlsup1 wrote:

    Emulation is slow when trap overhead is large and not-slow when trap
    overhead is small.

    I think it was a particular version of the old Mac OS, from around
    1990 or
    so, that implemented a really amazing hack. Some 32-bit machines had
    hardware floating-point, others didn’t. So developers of numerics-
    intensive apps had to build two versions of their code, one with the
    floating-point instructions, the other with calls to Apple’s SANE
    library.

    The hack involved running code built to use hardware floating-point
    instructions, on hardware that didn’t have them. The instructions
    were of
    course trapped and emulated. But more than that, the system would patch
    the instruction that caused the trap, turning it into a direct call into
    the emulation routine. So after the first execution, each such
    instruction
    would run much faster. Until the code got unloaded from RAM and the patch
    was lost, of course.

    This only works when each FP instruction is at least as long as a
    function call. This particular approach was standard on PCs more or less
    from the very beginning (i.e. 1981++):


    I believe that the same approach (trap and patch) was used in the HP
    2100 computers that I used in the early 1980's, which were designed in
    the 1960's. I don't think any NOPs were needed to match instruction
    lengths for these machines.

    You can also do it the other way around: always compile a function call,
    but on a machine that has an FPU use a dummy emulation library that back-patches the call to become an FPU instruction, so that each
    emulation function is called at most once.

    To be honest, I'm not sure which way around the HP 2100 used.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to Niklas Holsti on Tue Jun 11 05:58:21 2024
    On Mon, 10 Jun 2024 11:20:14 +0300, Niklas Holsti wrote:

    You can also do it the other way around: always compile a function call,
    but on a machine that has an FPU use a dummy emulation library that back-patches the call to become an FPU instruction, so that each
    emulation function is called at most once.

    To be honest, I'm not sure which way around the HP 2100 used.

    Hmm, maybe that was the way round it was done on the Mac as well, and I am misremembering.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)