Forum: >>> Magnum BBS <<<

Making Lemonade (Floating-point format changes)

From John Savard@21:1/5 to All on Sat May 11 21:44:45 2024

I've made another long-overdue change in the Concertina II
architecture on the page about 17-bit instructions.

Since I describe the individual instructions there, with their opcodes
and what they do, I've illustrated the floating-point formats of the architecture on that page.

The good people in charge of the IEEE 754 standard had seen fit to
define a standard 128-bit floating-point format which included a
hidden first bit.

This annoyed me greatly, because I was going to take the 8087's
temporary real format, and extend the mantissa for my 128-bit format.

I've decided that it's necessary to fully accept the 128-bit standard
and support it in a consistent manner.

Therefore, I have taken the following actions:

I have dropped the option of supporting 80-bit temporary reals
entirely, as they are now incompatible as an internal format.

I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.

And in addition, just as the IBM 704 used two single-precision floats
to make a double-precision float, and the IBM System/360 Model 85
started using two double-precision floats to make an extended
precision float... I've defined how the 256-bit internal format floats
can be doubled up to make a 512-bit float.

I'm not really sure such floating-pont precision is useful, but I do
remember some people telling me that higher float precision is indeed
something to be desired. Well, the IEEE 754 standard has forced my
hand.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Sun May 12 13:46:28 2024

John Savard <[email protected]d> schrieb:

I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.

Why not the IEEE binary256 (interchange) format?

https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format

[...]

I've defined how the 256-bit internal format floats
can be doubled up to make a 512-bit float.

Such floating point formats have very strange properties.
For example, try defining epsilon so that 1.0+epsilon is the
smallest number larger than 1.0...

IBM just spent a lot of effort to move away from that for POWER.

I'm not really sure such floating-pont precision is useful, but I do
remember some people telling me that higher float precision is indeed something to be desired. Well, the IEEE 754 standard has forced my
hand.

How so?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From wolfgang kern@21:1/5 to John Savard on Sun May 12 15:30:40 2024

On 12/05/2024 05:44, John Savard wrote:

I've made another long-overdue change in the Concertina II
architecture on the page about 17-bit instructions.

Since I describe the individual instructions there, with their opcodes
and what they do, I've illustrated the floating-point formats of the architecture on that page.

The good people in charge of the IEEE 754 standard had seen fit to
define a standard 128-bit floating-point format which included a
hidden first bit.

This annoyed me greatly, because I was going to take the 8087's
temporary real format, and extend the mantissa for my 128-bit format.

I've decided that it's necessary to fully accept the 128-bit standard
and support it in a consistent manner.

Therefore, I have taken the following actions:

I have dropped the option of supporting 80-bit temporary reals
entirely, as they are now incompatible as an internal format.

I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.

And in addition, just as the IBM 704 used two single-precision floats
to make a double-precision float, and the IBM System/360 Model 85
started using two double-precision floats to make an extended
precision float... I've defined how the 256-bit internal format floats
can be doubled up to make a 512-bit float.

I'm not really sure such floating-point precision is useful, but I do remember some people telling me that higher float precision is indeed something to be desired. Well, the IEEE 754 standard has forced my
hand.

YES, I'd use something similar:
I never cared nor supported any odd 10 byte formats and I give a fart to
all these weird IEEE standards.

my OS and it's calculator support only 2^N data starting with 4bit
nibbles up to 512 bit mantissa, both signed and unsigned and optionally
added by an 2^N or an x^10 valued exponent (nibble based sizes).

I finally (1998) made all numeric variable types QUAD-based (4,8,12,...)
this made all calculation and input/output routines short/fast and give
my clients the option to define their own formats (ie: 12+4, 28+4,...).

I can also use BCD variables, but they belong to text here and were
converted to binary on the fly when entered in calculations.
__
wolfgang

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Sun May 12 11:35:15 2024

On Sun, 12 May 2024 13:46:28 -0000 (UTC), Thomas Koenig
<[email protected]> wrote:

John Savard <[email protected]d> schrieb:

I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.

Why not the IEEE binary256 (interchange) format?

https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format

Oh, drat. I had not realized that they had also defined this.

Now that means I need to make the exponent field in my internal
format larger, define a 512-bit floating point number which is in the
internal format, so that it can be unnormalized, and a 1024-bit
doubled-up float... instead of what I just did!

The enlarged exponent field won't make the internal form of the
128-bit float go over 160 bits, so register allocation for th at won't change... but now I will have to figure out a scheme of register
allocation applicable to the 256-bit floats!

I am not amused.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to [email protected] on Sun May 12 12:15:10 2024

On Sun, 12 May 2024 11:35:15 -0600, John Savard
<[email protected]d> wrote:

Now that means I need to make the exponent field in my internal
format larger, define a 512-bit floating point number which is in the >internal format, so that it can be unnormalized, and a 1024-bit
doubled-up float... instead of what I just did!

The update has been made.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to John Savard on Sun May 12 20:34:00 2024

In article <[email protected]>, [email protected]d (John Savard) wrote:

I'm not really sure such floating-pont precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired.

I would be in favour of 128-bit being available. I'm not sure my field
has need for 256- or 512-bit, but that doesn't mean that nobody has.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Sun May 12 20:12:22 2024

John Savard wrote:

On Sun, 12 May 2024 13:46:28 -0000 (UTC), Thomas Koenig <[email protected]> wrote:

John Savard <[email protected]d> schrieb:

I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.

Why not the IEEE binary256 (interchange) format?

https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format

Oh, drat. I had not realized that they had also defined this.

Now that means I need to make the exponent field in my internal
format larger, define a 512-bit floating point number which is in the internal format, so that it can be unnormalized, and a 1024-bit
doubled-up float... instead of what I just did!

The enlarged exponent field won't make the internal form of the
128-bit float go over 160 bits, so register allocation for th at won't change... but now I will have to figure out a scheme of register
allocation applicable to the 256-bit floats!

I am not amused.

Question:: why are you all so gung-ho on having a format without a hidden bit. It is trivially easy to reconstruct::

h = expon != 0;

Taking little time of even gates; and is something you HAVE to do anyway.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Dallman on Sun May 12 20:55:03 2024

John Dallman <[email protected]> schrieb:

In article <[email protected]>, [email protected]d (John Savard) wrote:

I'm not really sure such floating-pont precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired.

I would be in favour of 128-bit being available.

Me, too. Solving tricky linear systems, or obtaining derivatives
numerically (for example for Jacobians) eats up a _lot_ of precision
bits, and double precision can sometimes run into trouble.

At least gcc and gfortran now support POWER's native 128-bit format
in hardware. On other systems, software emulation is used, which
is of course much slower.

I'm not sure my field
has need for 256- or 512-bit, but that doesn't mean that nobody has.

I've finally found the time to play around with Julia in the last
few weeks. One of the nice things does is that you can just use
the same packages with different numerical types, for example for
ODE integration. Just set up the problem as you would normally
and supply an starting vector with a different precision.

So, for doing some experiments on numerical data types, Julia
is quite nice.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sun May 12 18:00:39 2024

On Sun, 12 May 2024 20:12:22 +0000, [email protected] (MitchAlsup1)
wrote:

Question:: why are you all so gung-ho on having a format without a hidden bit. >It is trivially easy to reconstruct::

h = expon != 0;

Taking little time of even gates; and is something you HAVE to do anyway.

The explanation here is, I am afraid, probably once again ignorance on
my part. I knew that processing the hidden bit would take _some_ time
and effort, since, after all, in the early days computers didn't use
formats that had one.

So I assumed that converting to an internal format without a hidden
bit - even the 8087 did that - would yield a significant speed up.
(And, as noted, I'm following Seymour Cray in sacrificing everything
for speed, so even if the speedup is small, I am inclined to chase
it.)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Mon May 13 15:16:47 2024

On Sun, 12 May 2024 20:55:03 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

John Dallman <[email protected]> schrieb:

In article <[email protected]>, [email protected]d (John Savard) wrote:

I'm not really sure such floating-pont precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired.

I would be in favour of 128-bit being available.

Me, too. Solving tricky linear systems, or obtaining derivatives
numerically (for example for Jacobians) eats up a _lot_ of precision
bits, and double precision can sometimes run into trouble.

At least gcc and gfortran now support POWER's native 128-bit format
in hardware. On other systems, software emulation is used, which
is of course much slower.

Much slower?
I think, at least for matrix multiplication, my emulation on modern x86
was within factor of 1.5x from your measurements on POWER9. And that
despite rather poorly chosen ABI for support routines. With better ABI
(pure integer, with no copies from/to XMM slowing things down, esp. on
Zen3) I would expect it to be a wash.
With slightly higher-level API, (qaxpy instead of individual mul/add)
a software can actually pull ahead.

I'm not sure my field
has need for 256- or 512-bit, but that doesn't mean that nobody
has.

I've finally found the time to play around with Julia in the last
few weeks. One of the nice things does is that you can just use
the same packages with different numerical types, for example for
ODE integration. Just set up the problem as you would normally
and supply an starting vector with a different precision.

So, for doing some experiments on numerical data types, Julia
is quite nice.

It's a pity that something like that is not available in GNU Octave.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Mon May 13 19:01:37 2024

Michael S <[email protected]> schrieb:

On Sun, 12 May 2024 20:55:03 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

John Dallman <[email protected]> schrieb:

In article <[email protected]>,
[email protected]d (John Savard) wrote:

I'm not really sure such floating-pont precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired.

I would be in favour of 128-bit being available.

Me, too. Solving tricky linear systems, or obtaining derivatives
numerically (for example for Jacobians) eats up a _lot_ of precision
bits, and double precision can sometimes run into trouble.

At least gcc and gfortran now support POWER's native 128-bit format
in hardware. On other systems, software emulation is used, which
is of course much slower.

Much slower?
I think, at least for matrix multiplication, my emulation on modern x86
was within factor of 1.5x from your measurements on POWER9.

I don't remember the exact timing, and it might be interesting to
revisit that (also considering that the gfortran code for matmul is
not optimized for 128-bit float and might have blown cache sizes,
plus it would be fair to compare compiler vs. compiler and assembler
vs. assembler).

I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.

What can your code get on x86_64?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon May 13 21:16:48 2024

BGB wrote:

Emulation via traps is very slow, but typical for many ISA's is to just quietly turn the soft-float operations into runtime calls.

I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.

On an x86 this would be at least 200 cycles just getting there and back.

So, to revisit your statement::

Emulation is slow when trap overhead is large and not-slow when trap overhead is small.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon May 13 23:25:25 2024

BGB wrote:

On 5/13/2024 4:16 PM, MitchAlsup1 wrote:

BGB wrote:

Emulation via traps is very slow, but typical for many ISA's is to
just quietly turn the soft-float operations into runtime calls.

I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.

On an x86 this would be at least 200 cycles just getting there and back.

I guess there are different possibilities here...

Trap cost can be reduced, say, by having banked registers.
But, not so good with explicit save/restore and a large register file.

For example, I can note that a MSP430 at 16MHz can service a 32kHz
timer... (with a budget of 488 cycles per interrupt).

But, my BJX2 core (at 50MHz) would have a harder time here, with around
a 1.5k cycle budget...

Then again, it is possible the per-interrupt overhead would go down
slightly, since most likely the ISR stack will still be in the L1 cache between interrupts (and save/restore overhead should drop to ~ 100
cycles in the absence of L1 misses).

MSP430 had a slight advantage here (besides fewer registers) in that L1 misses are not a thing (so, memory access has constant latency).

So, to revisit your statement::

Emulation is slow when trap overhead is large and not-slow when trap
overhead
is small.

Possible, but I would not expect trap overhead to be lower than runtime
call overhead...

Yes, of course, trapping can never be quite as inexpensive as a CALL/RET sequence.

But it does not have to be much larger--just a little bit larger.

Also (in my case):
Debugging is rather annoying in cases where dealing with bugs appear/disappear/move around at random or with the slightest
perturbation...

You need better verification--Oh Wait ...

But, given for the most part behavior is consistently buggy (and
manifesting in seemingly the same ways) between both the emulator and
Verilog implementation, this implies the causal factors are in software.

I guess in this case, either I figure it out, or will need to again go
back to cooperative scheduling. Seemingly, using preemptive scheduling
and virtual memory at the same time is particularly unstable (programs
tend to crash on startup or soon after).

Also I may need to rework how page-in/page-out is handled (and or how IO
is handled in general) since if a page swap needs to happen while IO is already in progress (such as a page-miss in the system-call process), at present, the OS is dead in the water (one can't access the SDcard in the middle of a different access to the SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults
of the OS page fault handler.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue May 14 00:43:22 2024

BGB wrote:

On 5/13/2024 6:25 PM, MitchAlsup1 wrote:

Also (in my case):
Debugging is rather annoying in cases where dealing with bugs
appear/disappear/move around at random or with the slightest
perturbation...

You need better verification--Oh Wait ...

Not sure I understand what you mean by this.

Some of these bugs are behaving very similar to some bugs I was battling against a while ago (but never properly debugged, the bug just sort of seemingly disappeared).

What you need is 1M-1B instructions that test every known corner case
so if a fix breaks something else you will find it almost instantaneously.
{At 50 MHz 1B instructions is 50 seconds of run time.}

In the semiconductor industry, we use this test suite in the simulators, emulators, on the test head, and on parts returned from the field. Also,
it is under constant evolution simply so we don't let corner cases let
bugs into sellable parts.

Also I may need to rework how page-in/page-out is handled (and or how
IO is handled in general) since if a page swap needs to happen while
IO is already in progress (such as a page-miss in the system-call
process), at present, the OS is dead in the water (one can't access
the SDcard in the middle of a different access to the SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults
of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it also abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
So, having a GuestOS in a position it cannot deal with another page
fault is no longer a hindrance:: GuestOS does not see that page fault;
it is just handled and goes away.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to [email protected] on Tue May 14 05:35:53 2024

[email protected] (MitchAlsup1) writes:

I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.

Which MIPS? R2000? R10000? Something else? Was this an inverted page
table?

On an x86 this would be at least 200 cycles just getting there and back.

Which x86? 8086? 80186? 80286? These (maybe the 8088 and V20, too)
are the only implementations that deserve to be called x86. If you
mean some IA-32 or AMD64 implementations, which ones?

Anyway, let's see how this works for the U74 (a RISC-V implementation
which apparently uses trapping for unaligned loads); here we have a
10M iteration loop with a payload that performs one load per
iteration:

[fedora-starfive:~/nfstmp/gforth-riscv:104544] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye'

Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye':

223805151 instructions:u # 0.70 insn per cycle
318131306 cycles:u

0.352533487 seconds time elapsed

0.257061000 seconds user
0.064265000 seconds sys

[fedora-starfive:~/nfstmp/gforth-riscv:104545] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye'

Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye':

5329494415 instructions:u # 0.75 insn per cycle
7149481783 cycles:u

7.183239751 seconds time elapsed

7.082298000 seconds user
0.070121000 seconds sys

So the unaligned access handling result in 511 additional instructions
per load compared to an aligned access (so it obviously does the
handling using some kind of trapping). Each unaligned access results
in 683 additional cycles.

So better use the unspecified MIPS, right? However, if the
unspecified MIPS is an R2000, 19 cycles on a 12.5MHz R2000 cost
1.52us, whereas 683 cycles on a 1000MHz U74 cost 0.683us (and I have
heard that in the Visionfive V2 the U74 runs at 1500MHz).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Tue May 14 13:49:09 2024

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and or how
IO is handled in general) since if a page swap needs to happen while
IO is already in progress (such as a page-miss in the system-call
process), at present, the OS is dead in the water (one can't access
the SDcard in the middle of a different access to the SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults
of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it also
abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O device hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.

So, having a GuestOS in a position it cannot deal with another page
fault is no longer a hindrance:: GuestOS does not see that page fault;
it is just handled and goes away.

There are two levels of page faults - at the guest level, the
guest handles everything. When the hypervisors supports
multplexing multple guests on a core, it will only handle second
level translation table faults.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Tue May 14 17:51:15 2024

On Tue, 14 May 2024 13:49:09 GMT
[email protected] (Scott Lurndal) wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and
or how IO is handled in general) since if a page swap needs to
happen while IO is already in progress (such as a page-miss in
the system-call process), at present, the OS is dead in the
water (one can't access the SDcard in the middle of a different
access to the SDcard).

Having a HyperVisor helps a lot here, with HV taking the page
faults of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the
I/O device hardware directly and initiates DMA transactions
to-or-from the guest OS directly. With the PCIe PRI (Page Request Interface), the guest DMA target pages don't need to be pinned by the hypervisor; the I/O MMU will interrupt the hypervisor to make the
page present and pin it and the hardware will then do the DMA.

Sounds like it could be problematic from real-time perspective.
When I design PCIe devices, I sometimes have device-side FIFO
sufficient for 2-5 times an expected worst-case PCIe latency, i.e.
for 7-10 usec or so. In scenario, you describe, it could easily overflow
for acquisition-type device or underflow for player-type device.

Now, my devices are not intended to be plugged into visualized server,
but I'd think that I am not the only designer that choses size of FIFOs
by that sort of logic.

So, having a GuestOS in a position it cannot deal with another page
fault is no longer a hindrance:: GuestOS does not see that page
fault; it is just handled and goes away.

There are two levels of page faults - at the guest level, the
guest handles everything. When the hypervisors supports
multplexing multple guests on a core, it will only handle second
level translation table faults.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Tue May 14 15:57:59 2024

Michael S <[email protected]> writes:

On Tue, 14 May 2024 13:49:09 GMT
[email protected] (Scott Lurndal) wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and
or how IO is handled in general) since if a page swap needs to
happen while IO is already in progress (such as a page-miss in
the system-call process), at present, the OS is dead in the
water (one can't access the SDcard in the middle of a different
access to the SDcard).

Having a HyperVisor helps a lot here, with HV taking the page
faults of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the
I/O device hardware directly and initiates DMA transactions
to-or-from the guest OS directly. With the PCIe PRI (Page Request
Interface), the guest DMA target pages don't need to be pinned by the
hypervisor; the I/O MMU will interrupt the hypervisor to make the
page present and pin it and the hardware will then do the DMA.

Sounds like it could be problematic from real-time perspective.

Not really. The device presents 'virtual functions' to the
guest. The physical function (owned by the hypervisor) can
assign adapter resources (rings, queues, CAMS, interrupt vectors) to the virtual function which are then 'owned' by the guest operating
system.

When I design PCIe devices, I sometimes have device-side FIFO
sufficient for 2-5 times an expected worst-case PCIe latency, i.e.
for 7-10 usec or so. In scenario, you describe, it could easily overflow
for acquisition-type device or underflow for player-type device.

We have SR-IOV devices that support hundreds of virtual functions
each of which can handle packet traffic at line rate across multiple
100Gb network ports.

https://docs.kernel.org/networking/device_drivers/ethernet/marvell/octeontx2.html

Now, my devices are not intended to be plugged into visualized server,
but I'd think that I am not the only designer that choses size of FIFOs
by that sort of logic.

If you're building an SR-IOV device, you obviously need to build
it to support the required workloads.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Tue May 14 16:00:33 2024

[email protected] (MitchAlsup1) writes:

Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.

Which MIPS? R2000? R10000? Something else? Was this an inverted page
table?

R3000 and it was a hast table ~1MB in size.

Would would have been a significant fraction (25%?)of the
total memory available on a R3k based system in 1990.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Tue May 14 15:19:34 2024

Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.

Which MIPS? R2000? R10000? Something else? Was this an inverted page
table?

R3000 and it was a hast table ~1MB in size.

On an x86 this would be at least 200 cycles just getting there and back.

Which x86? 8086? 80186? 80286? These (maybe the 8088 and V20, too)
are the only implementations that deserve to be called x86. If you
mean some IA-32 or AMD64 implementations, which ones?

Anyway, let's see how this works for the U74 (a RISC-V implementation
which apparently uses trapping for unaligned loads); here we have a
10M iteration loop with a payload that performs one load per
iteration:

[fedora-starfive:~/nfstmp/gforth-riscv:104544] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye'

Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye':

223805151 instructions:u # 0.70 insn per cycle
318131306 cycles:u

0.352533487 seconds time elapsed

0.257061000 seconds user
0.064265000 seconds sys

[fedora-starfive:~/nfstmp/gforth-riscv:104545] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye'

Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye':

5329494415 instructions:u # 0.75 insn per cycle
7149481783 cycles:u

7.183239751 seconds time elapsed

7.082298000 seconds user
0.070121000 seconds sys

So the unaligned access handling result in 511 additional instructions
per load compared to an aligned access (so it obviously does the
handling using some kind of trapping). Each unaligned access results
in 683 additional cycles.

Yes, but notice sys time hardly changes, so, RISC-V is performing the misaligned LD in user mode (2 context switches -- likely somewhat light weight).

So better use the unspecified MIPS, right? However, if the
unspecified MIPS is an R2000, 19 cycles on a 12.5MHz R2000 cost
1.52us, whereas 683 cycles on a 1000MHz U74 cost 0.683us (and I have
heard that in the Visionfive V2 the U74 runs at 1500MHz).

Given at least the same cache footprint a 2GHz R3000 would still be
in the 20-cycle range. {{That 19 cycle TLB reload is dependent on
the handler and its table have a footprint in the cache(s).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Tue May 14 17:48:49 2024

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

Anton Ertl wrote:

[email protected] (MitchAlsup1) writes:

I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.

Which MIPS? R2000? R10000? Something else? Was this an inverted page
table?

R3000 and it was a hast table ~1MB in size.

Would would have been a significant fraction (25%?)of the
total memory available on a R3k based system in 1990.

I heard numbers in the 10% range, so we are within a factor of 2 in
our memory. The smaller main memory was, the less can be co-resident
and the smaller the effective table size.

{{But my MIPS info is mostly 3rd hand}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue May 14 22:19:25 2024

On Mon, 13 May 2024 19:01:37 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

On Sun, 12 May 2024 20:55:03 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

John Dallman <[email protected]> schrieb:

In article <[email protected]>,
[email protected]d (John Savard) wrote:

I'm not really sure such floating-pont precision is useful, but
I do remember some people telling me that higher float
precision is indeed something to be desired.

I would be in favour of 128-bit being available.

Me, too. Solving tricky linear systems, or obtaining derivatives
numerically (for example for Jacobians) eats up a _lot_ of
precision bits, and double precision can sometimes run into
trouble.

At least gcc and gfortran now support POWER's native 128-bit format
in hardware. On other systems, software emulation is used, which
is of course much slower.

Much slower?
I think, at least for matrix multiplication, my emulation on modern
x86 was within factor of 1.5x from your measurements on POWER9.

I don't remember the exact timing, and it might be interesting to
revisit that (also considering that the

IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix multiplication benchmark running on a single POWER9 core.

I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
GHz) using my plug-in replacements for gcc __multf3/__addtf3 with the
level of support for FP exceptions and rounding modes that, according
to you, is sufficient for Fortran, but according to other gnu
maintainers is insufficient for C. For matrix multiplication
implemented with vector APIs ('multiply vector by scalar' and 'add
vectors') on the same EPYC3 I got approximately 200 MFLOPS.

gfortran code for matmul is
not optimized for 128-bit float and might have blown cache sizes,

That's possible, but unlikely to make a major impact.
At 200 MFLOPS even L3 cache is not a bottleneck. And it's actually hard
to code matrix multiplication so poorly that at least half of data
wouldn't come from L1D/L2. I took a look at GFortran sources for matmul
- they are not that bad.

plus it would be fair to compare compiler vs. compiler and assembler
vs. assembler).

My routines are implemented in 'C' and compiled with gcc

I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.

So, a bottleneck is somewhere else. May be, multiplication?

What can your code get on x86_64?

Se above.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Wed May 15 12:04:54 2024

On Tue, 14 May 2024 15:57:59 GMT
[email protected] (Scott Lurndal) wrote:

If you're building an SR-IOV device, you obviously need to build
it to support the required workloads.

I am building PCIe device+driver that is unaware of SR-IOV.
I think that Mitch was operating under the same assumption in the post
to which you responded.
When both my device and my driver are aware of presence of additional
layers of hardware and especially of software between them, then it
could be made working, but in this case virtualization is no longer transparent, although non-transparency is of different variety from non-transparency of paravirtualization.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to wolfgang kern on Wed May 15 12:07:13 2024

On Sun, 12 May 2024 15:30:40 +0200
wolfgang kern <[email protected]> wrote:

On 12/05/2024 05:44, John Savard wrote:

I've made another long-overdue change in the Concertina II
architecture on the page about 17-bit instructions.

Since I describe the individual instructions there, with their
opcodes and what they do, I've illustrated the floating-point
formats of the architecture on that page.

The good people in charge of the IEEE 754 standard had seen fit to
define a standard 128-bit floating-point format which included a
hidden first bit.

This annoyed me greatly, because I was going to take the 8087's
temporary real format, and extend the mantissa for my 128-bit
format.

I've decided that it's necessary to fully accept the 128-bit
standard and support it in a consistent manner.

Therefore, I have taken the following actions:

I have dropped the option of supporting 80-bit temporary reals
entirely, as they are now incompatible as an internal format.

I have instead defined a 256-bit format for floats which does not
have a hidden first bit, which looks like the old temporary reals,
except that the exponent field is one bit wider.

And in addition, just as the IBM 704 used two single-precision
floats to make a double-precision float, and the IBM System/360
Model 85 started using two double-precision floats to make an
extended precision float... I've defined how the 256-bit internal
format floats can be doubled up to make a 512-bit float.

I'm not really sure such floating-point precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired. Well, the IEEE 754 standard has
forced my hand.

YES, I'd use something similar:
I never cared nor supported any odd 10 byte formats and I give a fart
to all these weird IEEE standards.

I suppose, it's mutual.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Wed May 15 13:29:48 2024

Michael S <[email protected]> writes:

On Tue, 14 May 2024 15:57:59 GMT
[email protected] (Scott Lurndal) wrote:

If you're building an SR-IOV device, you obviously need to build
it to support the required workloads.

I am building PCIe device+driver that is unaware of SR-IOV.

If you expect your device to be used by virtualized (guest)
operating systems, then I would strongly recommend that you
support the SR-IOV capability.

I think that Mitch was operating under the same assumption in the post
to which you responded.

I'll allow Mitch to speak for himself.

Note that pretty much every server-grade network
controller (NIC) supports SR-IOV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Wed May 15 20:08:27 2024

Michael S <[email protected]> schrieb:

IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix multiplication benchmark running on a single POWER9 core.

Just reran the tests, it gave me somewhere around 405-410 MFlops
on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
This is with the standard gfortran matmul routine.

I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
GHz) using my plug-in replacements for gcc __multf3/__addtf3

Scaled to frequency, the hardware implementation on POWER is then
better by a factor of around four. Not too bad, actually.

[..]

I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.

So, a bottleneck is somewhere else. May be, multiplication?

I messed up the name of the instruction. What I meant was xsmaddqp
(just trips off the tounge, doesn't it?), which on POWER9 actually
has a throughput of 1/13 per cycle, a big, fat instruction,
obviously. On POWER10, this actually got worse, with performance
dropping to 1/18 per cycle, with a latency of 25 cycles. Hm,
apparently somebody didn't think it was all that important,
apparently :-(

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Thu May 16 00:16:28 2024

On Wed, 15 May 2024 20:08:27 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

Michael S <[email protected]> schrieb:

IIRC, you reported something like 200 (or 300?) MFLOPS for your
matrix multiplication benchmark running on a single POWER9 core.

Not too bad. Not too good, either.

Just reran the tests, it gave me somewhere around 405-410 MFlops
on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
This is with the standard gfortran matmul routine.

I don't think that nowadays /proc/cpuinfo has any relationship to
actual frequency. Most likely with a single core active even the
cheapest POWER9 SKU runs at 3.8 GHz.
If there is no ready-made utility, you can measure it by yourself with latency-bound loop. Just don't forget that on POWER9 all simple integer
opcodes have latency=2.
If there are any difficulties, I can help.

I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
GHz) using my plug-in replacements for gcc __multf3/__addtf3

Scaled to frequency, the hardware implementation on POWER is then
better by a factor of around four. Not too bad, actually.

If my guess about frequency is correct, then more like factor of 2.6.
Of which, factor of approximately 1.3 has to be attributed to bad
libgcc ABI.
[O.T.]
BTW, on ARM64 libgcc ABI for __multf3/__addtf3 is similarly bad. The
only decent ABI for __multf3/__addtf3 that I encountered experimenting
on godbolt was for RV64. But that a little consolation considering huge performance gap between the best RV64 and not even the best, but just a competent iAMD64 or ARM64.
[/O.T.]

Anyway, performance per clock is of limited interest. What matters is
absolute performance (sometimes throughput, sometimes latency) and
performance per watt.
I would guess, that using SMT4 POWER9 can get over 80% of theoretical throughput, but getting here would take either multiplying really big
matrix or lots of medium ones.
On EPYC3, on the other hand, I don't expect measurable SMT gain. But
relatively to POWER9 EPYC3 has more cores and much lower power
consumption per core.

[..]

I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.

So, a bottleneck is somewhere else. May be, multiplication?

I messed up the name of the instruction. What I meant was xsmaddqp
(just trips off the tounge, doesn't it?), which on POWER9 actually
has a throughput of 1/13 per cycle, a big, fat instruction,
obviously. On POWER10, this actually got worse, with performance
dropping to 1/18 per cycle, with a latency of 25 cycles. Hm,
apparently somebody didn't think it was all that important,
apparently :-(

Sounds like that.
Hopefully it's compensated by better power efficiency. And
unfortunately it's aggravated by lower cost-effectiveness. Or, at least
that what was claimed by poster (luke.l ?) here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Thu May 16 21:22:39 2024

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and or how >>>>> IO is handled in general) since if a page swap needs to happen while >>>>> IO is already in progress (such as a page-miss in the system-call
process), at present, the OS is dead in the water (one can't access
the SDcard in the middle of a different access to the SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults
of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it also
abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O device hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.

This was something I was not aware of but probably should have anticipated.

GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.

This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.

I guess I should have anticipated this. Sorry !!

So, having a GuestOS in a position it cannot deal with another page
fault is no longer a hindrance:: GuestOS does not see that page fault;
it is just handled and goes away.

There are two levels of page faults - at the guest level, the
guest handles everything. When the hypervisors supports
multplexing multple guests on a core, it will only handle second
level translation table faults.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Thu May 16 22:07:41 2024

[email protected] (MitchAlsup1) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and or how >>>>>> IO is handled in general) since if a page swap needs to happen while >>>>>> IO is already in progress (such as a page-miss in the system-call
process), at present, the OS is dead in the water (one can't access >>>>>> the SDcard in the middle of a different access to the SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults >>>>> of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it also >>>> abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.

This was something I was not aware of but probably should have anticipated.

GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.

This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.

I guess I should have anticipated this. Sorry !!

Add in MR-IOV and CXL and both I/O and memory can be shared
cluster-wide.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Thu May 16 23:10:30 2024

EricP <[email protected]> writes:

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and or >>>>>>> how IO is handled in general) since if a page swap needs to happen >>>>>>> while IO is already in progress (such as a page-miss in the
system-call process), at present, the OS is dead in the water (one >>>>>>> can't access the SDcard in the middle of a different access to the >>>>>>> SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>> of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.

This was something I was not aware of but probably should have anticipated. >>
GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.

This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.

I guess I should have anticipated this. Sorry !!

The reason OS's pin the pages before the IO starts is so there is no
latency reading in from a device, which then has to buffer the input.
An HDD seek avg about 9 ms, add 3 ms for the page fault code.
A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.

100Gb/s Ethernet is 10GB/s (20GB/s full duplex). A modern SoC may support multiple 100Gb controllers.

Granted, for low latency data, the OS and hypervisor will
cooperate to ensure that the pages are marked present in the IOMMU
translation mapping tables before the I/O is initiated; PRI is there for
use cases where the latency to make a page present isn't critical.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu May 16 18:30:36 2024

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and or
how IO is handled in general) since if a page swap needs to happen >>>>>> while IO is already in progress (such as a page-miss in the
system-call process), at present, the OS is dead in the water (one >>>>>> can't access the SDcard in the middle of a different access to the >>>>>> SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults >>>>> of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.

This was something I was not aware of but probably should have anticipated.

GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.

This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.

I guess I should have anticipated this. Sorry !!

The reason OS's pin the pages before the IO starts is so there is no
latency reading in from a device, which then has to buffer the input.
An HDD seek avg about 9 ms, add 3 ms for the page fault code.
A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.

What would likely happen is the Ethernet card buffer would fill up
then it starts tossing packets, while it waits for HV to page fault
the receive buffer in from its page file. Later when the guest OS
buffer has faulted in and the card's buffer is emptied, the network
software will eventually NAK all the tossed packets and they get resent.

So there is a stutter every time the HV recycles that guest OS memory
that requires retransmissions to fix. And this is basically using the
senders memory to buffer the transmission while this HV page faults.

Note there are devices, like A to D converters which cannot fix the
tossed data by asking for a retransmission. Or devices like tape drives
which can rewind and reread but are verrry slow about it.

I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu May 16 23:10:48 2024

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and or >>>>>>> how IO is handled in general) since if a page swap needs to happen >>>>>>> while IO is already in progress (such as a page-miss in the
system-call process), at present, the OS is dead in the water (one >>>>>>> can't access the SDcard in the middle of a different access to the >>>>>>> SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>> of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.

This was something I was not aware of but probably should have anticipated. >>
GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.

This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.

I guess I should have anticipated this. Sorry !!

The reason OS's pin the pages before the IO starts is so there is no
latency reading in from a device, which then has to buffer the input.
An HDD seek avg about 9 ms, add 3 ms for the page fault code.
A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.

What would likely happen is the Ethernet card buffer would fill up
then it starts tossing packets, while it waits for HV to page fault
the receive buffer in from its page file. Later when the guest OS
buffer has faulted in and the card's buffer is emptied, the network
software will eventually NAK all the tossed packets and they get resent.

So there is a stutter every time the HV recycles that guest OS memory
that requires retransmissions to fix. And this is basically using the
senders memory to buffer the transmission while this HV page faults.

Note there are devices, like A to D converters which cannot fix the
tossed data by asking for a retransmission. Or devices like tape drives
which can rewind and reread but are verrry slow about it.

I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.

So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.

Does this mean the application has lost a level of indirection in order
to have become virtualized ?????

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri May 17 15:26:20 2024

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup1) writes:

BGB wrote:

Also I may need to rework how page-in/page-out is handled (and >>>>>>>> or how IO is handled in general) since if a page swap needs to >>>>>>>> happen while IO is already in progress (such as a page-miss in >>>>>>>> the system-call process), at present, the OS is dead in the
water (one can't access the SDcard in the middle of a different >>>>>>>> access to the SDcard).

Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>>> of the OS page fault handler.

Seems like adding another layer couldn't help with this, unless it >>>>>> also abstracts away the SDcard interface.

With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.

Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the
I/O device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.

This was something I was not aware of but probably should have
anticipated.

GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.

This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page
virtualization.

I guess I should have anticipated this. Sorry !!

The reason OS's pin the pages before the IO starts is so there is no
latency reading in from a device, which then has to buffer the input.
An HDD seek avg about 9 ms, add 3 ms for the page fault code.
A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.

What would likely happen is the Ethernet card buffer would fill up
then it starts tossing packets, while it waits for HV to page fault
the receive buffer in from its page file. Later when the guest OS
buffer has faulted in and the card's buffer is emptied, the network
software will eventually NAK all the tossed packets and they get resent.

So there is a stutter every time the HV recycles that guest OS memory
that requires retransmissions to fix. And this is basically using the
senders memory to buffer the transmission while this HV page faults.

Note there are devices, like A to D converters which cannot fix the
tossed data by asking for a retransmission. Or devices like tape drives
which can rewind and reread but are verrry slow about it.

I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.

So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.

Does this mean the application has lost a level of indirection in order
to have become virtualized ?????

I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri May 17 21:16:33 2024

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.

So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.

Does this mean the application has lost a level of indirection in order
to have become virtualized ?????

I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.

I don't understand your question.

Most of users files are on the local system and SR-IOV works fine, but one
or more of his files exist on a remote machine accessed over the internet;
and user still uses SR-IOV interface to access those files.

How does the system provide the "file is local" illusion to a user having SR-IOV access to a non-local file.

For example, user opens a file which is an ln-s (a block containing a URL
to where the file is remotely stored) but user thinks file is local.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Chris M. Thomasson on Sat May 18 09:40:07 2024

Chris M. Thomasson wrote:

On 5/17/2024 12:26 PM, EricP wrote:

MitchAlsup1 wrote:

So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.

Does this mean the application has lost a level of indirection in order
to have become virtualized ?????

I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.

For some reason this made me think about getting a blue screen of death
due to too much non-paged memory being used by too many concurrent
overlapped IO's on Windows.

That shouldn't happen as Windows tracks each process's non-paged pool allocations and quotas and it should return an error when exceeded,
though I've never stress tested it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Sat May 18 14:25:16 2024

[email protected] (MitchAlsup1) writes:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.

So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.

Does this mean the application has lost a level of indirection in order
to have become virtualized ?????

I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.

I don't understand your question.

Most of users files are on the local system and SR-IOV works fine, but one
or more of his files exist on a remote machine accessed over the internet; >and user still uses SR-IOV interface to access those files.

SR-IOV is a feature of PCI Express devices.

How does the system provide the "file is local" illusion to a user having >SR-IOV access to a non-local file.

So which PCI Express device is being used to access the device(s) that
contain the file system that manages the file, which contains
the data blocks?

If it is a NVMe device, then the virtual functions are providing
access to portions of the NVram on the device (or behind the
device) as if it were a physical device. If it is a SATA
device, then the virtual function may be providing access to
a complete unit, or a partition on a shared unit.

For example, user opens a file which is an ln-s (a block containing a URL
to where the file is remotely stored) but user thinks file is local.

That's all handled by the file system code in the operating system. It's
when a particular data block is required that an SR-IOV virtual function
may be used (and the file system could easily be combining multiple
underlying devices (VFs) into a single filesystem (e.g. using volume
management facililites of the operating system) such that some blocks
in the file system are on a SAN (fibrechanel), some are on a LAN
(CIFS/NFS) and some are on a locally hosted SATA or NVMe device).

For the SAN case, the fibrechannel adapter will provide VFs that
can be used by the guest OS. Likewise for NVMe. I don't believe
that the SATA (which is rather obsolete now) AHCI standard supports
SR-IOV, but it would be pretty straightforward to add the SR-IOV
capability to a SATA controller.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Sat May 18 11:56:19 2024

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.

So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.

Does this mean the application has lost a level of indirection in order
to have become virtualized ?????

I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.

Most of users files are on the local system and SR-IOV works fine, but one
or more of his files exist on a remote machine accessed over the internet; and user still uses SR-IOV interface to access those files.

If by "works fine" you mean is slower and has more overhead than
just pinning the pages first as DMA I/O does now.

(Its more work to initiate the I/O, fail when it attempts to DMA,
interrupt cpu, run ISR which queues a DPC which queues an APC back to
the thread, which pins the pages, then restarts the I/O,
than it is to just pin the pages and start and I/O.)

How does the system provide the "file is local" illusion to a user having SR-IOV access to a non-local file.

For example, user opens a file which is an ln-s (a block containing a
URL to where the file is remotely stored) but user thinks file is local.

I think I see what you are getting at - how does this mechanism
transparently redirect the SR-IOV device request into a network request?

That link is traditionally established at file open inside the kernel file system by cross-linking between a File Control Block (or whatever its called) and a Network Control Block representing that file on the network.
Each file syscall request is sent to the FCB then forwarded to the NCB
and out over a network link.

As I understand it, SR-IOV is a pseudo hardware device control *register*, whereas a disk file is a fictional logical device created by the file
system driver. I don't think one could use SR-IOV to send commands to
local file system software (maybe it could trap into the OS).

One could have *disk controller registers* attached by SR-IOV,
but a disk controller is not a file system.

So as I understand the SR-IOV mechanism, one would not be reading
local or remote files over it under any circumstance.
But my understanding is limited to what the Microsoft driver
documentation says about it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Sat May 18 17:33:19 2024

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

I would want an option in this SR-IOV mechanism for the guest app to >>>>> tell the guest OS to tell the HV to pin the buffer before starting IO. >>>>

So, what happens if GuestOS thinks the user file is located on a local >>>> SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different >>>> system where the file is local, accessed and data returned over the
network.

Does this mean the application has lost a level of indirection in order >>>> to have become virtualized ?????

I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.

Most of users files are on the local system and SR-IOV works fine, but one >> or more of his files exist on a remote machine accessed over the internet; >> and user still uses SR-IOV interface to access those files.

If by "works fine" you mean is slower and has more overhead than
just pinning the pages first as DMA I/O does now.

Yes, copy on write has this problem too when most of the address space
gets copied.

(Its more work to initiate the I/O, fail when it attempts to DMA,
interrupt cpu, run ISR which queues a DPC which queues an APC back to
the thread, which pins the pages, then restarts the I/O,
than it is to just pin the pages and start and I/O.)

How does the system provide the "file is local" illusion to a user having
SR-IOV access to a non-local file.

For example, user opens a file which is an ln-s (a block containing a
URL to where the file is remotely stored) but user thinks file is local.

I think I see what you are getting at - how does this mechanism
transparently redirect the SR-IOV device request into a network request?

The other interpretation is that the unprivileged uses is never allowed
direct access to an SR-IOV device--those are restricted to GuestOS (or
more privileged hypervisor threads).

That link is traditionally established at file open inside the kernel file system by cross-linking between a File Control Block (or whatever its called) and a Network Control Block representing that file on the network.
Each file syscall request is sent to the FCB then forwarded to the NCB
and out over a network link.

As I understand it, SR-IOV is a pseudo hardware device control *register*, whereas a disk file is a fictional logical device created by the file
system driver. I don't think one could use SR-IOV to send commands to
local file system software (maybe it could trap into the OS).

One could have *disk controller registers* attached by SR-IOV,
but a disk controller is not a file system.

So as I understand the SR-IOV mechanism, one would not be reading
local or remote files over it under any circumstance.
But my understanding is limited to what the Microsoft driver
documentation says about it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Chris M. Thomasson on Sat May 18 21:48:14 2024

Chris M. Thomasson wrote:

On 5/18/2024 6:40 AM, EricP wrote:

Chris M. Thomasson wrote:

On 5/17/2024 12:26 PM, EricP wrote:

I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.

For some reason this made me think about getting a blue screen of
death due to too much non-paged memory being used by too many
concurrent overlapped IO's on Windows.

That shouldn't happen as Windows tracks each process's non-paged pool
allocations and quotas and it should return an error when exceeded,
though I've never stress tested it.

I have, wrt NT 4.0 back in the day. It can get to a point where the
system is totally unresponsive. Then, sometimes, dies. A shit load of concurrent overlapped io ops, malloc tends to return NULL, then the
non-paged memory gets really bad...

What I have seen is due to what turned out to be a bug in Windows Defender
is that in monitoring my internet packets it would leak non-paged pool,
which then grows consuming more and more free pages all while the system
gets slower and slower, until finally it hangs and has to be rebooted.
The solution was to disable Windows Defender, but I only discovered that
by chance (random thrashing about).

I've only seen one blue screen in 30 years of using WinNT.
That was a "Page fault at raised IRQL" in the Microsoft TCP driver
(basically, a page fault occurred inside a driver, a big no-no)
back in the 1990's and the replacement driver was already available
on the Microsoft website.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to All on Sun May 19 11:17:41 2024

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Sun May 19 16:23:33 2024

On Sun, 19 May 2024 11:17:41 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

I.e. your actual running frequency was 3700 MHz?

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

There exists a middle ground between none-pipelined and fully pipelined multiplier/FMA units. In fact, more than one middle ground.
Here the mid-middle ground that can imagine not being a real hardware
guy:
1 - take a pair of exiting VSU multipliers. By now they can do
53x53=>125bit unsigned multiplication. Enhance them to 57x57=>113bit
2 - during quad-precision FMA split 113x113 multiplication into 4
pieces and run them through pair of multiplies each two at once.
That would produce all parts of 225-bit product at rate of 1 product
per 2 clocks
3 - build adders just sufficient for the same throughput of 1 result
per 2 clocks.
Such combined multiplier will have 2 clocks higher latency than DP
multiplier.
After that we'll need matching alignment and addition/subtraction
blocks, but by doing them half-pipelined we can utilize majority of
existing dual-DP hardware and would need very little else, except of
control signals and probably of new feedback data path on the upper
side of the adder. All that could cost us another clock of latency over
DP FMA, but not necessarily so.
Bottom line: QP FMA with throughput of 1 result per 2 clocks and
latency of 8 or 9 clocks.
For POWER8, that has less distributed VSU, such modification would be
somewhat easier than for POWER9.

That's what I call a mid-middle ground.
Low-middle ground would be leaving 53x53=>125bit multipliers
unmodified. 113x113 multiplication is split into 9 pieces and
product is delivered every 5 clocks.

High-middle ground is enhancing both VSU pipes and using them to
process two QP FMAs simultaneously for combined throughput equivalent
to fully pipelined.

Another possible high-middle ground is, again, enhancing both VSU pipes
and using them together on a single QP FMA. That would be potentially
best for latency, but does not fit well into philosophy of POWER9
design that tries to minimize high-speed interaction between various
pipes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Sun May 19 16:02:01 2024

Michael S wrote:

On Sun, 19 May 2024 11:17:41 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

I.e. your actual running frequency was 3700 MHz?

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

There exists a middle ground between none-pipelined and fully pipelined multiplier/FMA units. In fact, more than one middle ground.
Here the mid-middle ground that can imagine not being a real hardware
guy:
1 - take a pair of exiting VSU multipliers. By now they can do
53x53=>125bit unsigned multiplication. Enhance them to 57x57=>113bit
2 - during quad-precision FMA split 113x113 multiplication into 4
pieces and run them through pair of multiplies each two at once.
That would produce all parts of 225-bit product at rate of 1 product
per 2 clocks
3 - build adders just sufficient for the same throughput of 1 result
per 2 clocks.
Such combined multiplier will have 2 clocks higher latency than DP multiplier.

That is the slow middle ground using the multiplier at ½ rate. AND is
in fact the design point for my low end machine (the div 2 part, not
the quad precision part).

Instead, one can use the multiplier at full speed. If as you state
below,
that FMAC is 9 cycles, DP FMAC here would be 7 cycles and 10 cycles for
QP FMAC.

On the other hand, I worry about throughput after I saw a string of
42 instructions in a row all using FMAC function unit in one particular benchmark.

After that we'll need matching alignment and addition/subtraction
blocks, but by doing them half-pipelined we can utilize majority of
existing dual-DP hardware and would need very little else, except of
control signals and probably of new feedback data path on the upper
side of the adder. All that could cost us another clock of latency over
DP FMA, but not necessarily so.
Bottom line: QP FMA with throughput of 1 result per 2 clocks and
latency of 8 or 9 clocks.
For POWER8, that has less distributed VSU, such modification would be somewhat easier than for POWER9.

That's what I call a mid-middle ground.
Low-middle ground would be leaving 53x53=>125bit multipliers
unmodified. 113x113 multiplication is split into 9 pieces and
product is delivered every 5 clocks.

High-middle ground is enhancing both VSU pipes and using them to
process two QP FMAs simultaneously for combined throughput equivalent
to fully pipelined.

Another possible high-middle ground is, again, enhancing both VSU pipes
and using them together on a single QP FMA. That would be potentially
best for latency, but does not fit well into philosophy of POWER9
design that tries to minimize high-speed interaction between various
pipes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Sun May 19 18:37:51 2024

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm guessing
that this could at least be close to needing an extra cycle on its own
and/or heroic hardware?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Sun May 19 16:43:40 2024

[email protected] (MitchAlsup1) writes:

EricP wrote:

I think I see what you are getting at - how does this mechanism
transparently redirect the SR-IOV device request into a network request?

The other interpretation is that the unprivileged uses is never allowed >direct access to an SR-IOV device--those are restricted to GuestOS (or
more privileged hypervisor threads).

Take a look at DPDK or ODP. Both support accessing PCI functions
directly from unprivileged processes on Intel and ARM64 systems;
doesn't matter if they're virtual functions via SR-IOV or physical
functions on a non-SRIOV device.

The key insight here is that SR-IOV is pretty much invisible to
the operating system - a virtual function looks just like a
physical function, and they're located the same way by scanning
the PCI configuration space via the ECAM. The host physical
function driver knows how to configure the SR-IOV capability
to expose the virtual functions, and the guest physical function
drivers accesses one of those VFs thinking it is a standard PF.

A filesystem driver in an operating system just passes block
requests to a device driver (SATA/NVMe/FC/NIC), and for this
purpose SR-IOV is completely invisible to the filesystem
driver and the device drivers themselves.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Sun May 19 16:38:11 2024

EricP <[email protected]> writes:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

MitchAlsup1 wrote:

Scott Lurndal wrote:

I would want an option in this SR-IOV mechanism for the guest app to >>>>> tell the guest OS to tell the HV to pin the buffer before starting IO. >>>>

So, what happens if GuestOS thinks the user file is located on a local >>>> SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different >>>> system where the file is local, accessed and data returned over the
network.

Does this mean the application has lost a level of indirection in order >>>> to have become virtualized ?????

I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.

Most of users files are on the local system and SR-IOV works fine, but one >> or more of his files exist on a remote machine accessed over the internet; >> and user still uses SR-IOV interface to access those files.

If by "works fine" you mean is slower and has more overhead than
just pinning the pages first as DMA I/O does now.

(Its more work to initiate the I/O, fail when it attempts to DMA,
interrupt cpu, run ISR which queues a DPC which queues an APC back to
the thread, which pins the pages, then restarts the I/O,
than it is to just pin the pages and start and I/O.)

How does the system provide the "file is local" illusion to a user having
SR-IOV access to a non-local file.

For example, user opens a file which is an ln-s (a block containing a
URL to where the file is remotely stored) but user thinks file is local.

I think I see what you are getting at - how does this mechanism
transparently redirect the SR-IOV device request into a network request?

That link is traditionally established at file open inside the kernel file >system by cross-linking between a File Control Block (or whatever its called) >and a Network Control Block representing that file on the network.
Each file syscall request is sent to the FCB then forwarded to the NCB
and out over a network link.

As I understand it, SR-IOV is a pseudo hardware device control *register*,

That's not quite correct. Consider PCI (or PCI Express) without
SR-IOV. The PCI designation for an assignable device entity is
called a 'function' aka 'physical function'. A PCI express device
can have up to 8 functions[*] - each of which is a full independent
controller instance (e.g. a SATA controller, or Network Interface
Controller).

There are three distinct address spaces for each function; a
configuration address space, a memory address space and an
I/O address space.

The configuration address space consists of 4096 bytes,
where the first 32 bytes describe a set of PCI configuration
registers first defined in the original PCI specification.

The remaining configuration space consists of two linked
lists of optional "capabilities", each of which has a specific
standard set of defined control and status registers.
There are dozens of optional capabilities, one of which is
the SR-IOV capability.

[*] There is a PCI Express extended capability called alternate
routing interpretation (ARI) which allows a PCI Express device
to support up to 256 functions, which consumes an entire
PCI bus.

When a device advertises the SR-IOV capability in its configuration
address space when the operating system scans the ECAM associated
with the root complex to which the device is attached, the operating
system will access the first function (typically zero) configuration
space to read the first four bytes (vendor ID and device ID). Those
are used to index into a driver table and then load the necessary
driver based on the device id. The driver will read the SR-IOV
capability status registers and update them to indicate how may
virtual functions should be advertised and what the associated
BAR registers should advertise for the VF BARs.

A physical function with SR-IOV can support up to 65535
virtual functions (consuming 256 256-function PCI buses)

whereas a disk file is a fictional logical device created by the file
system driver. I don't think one could use SR-IOV to send commands to
local file system software (maybe it could trap into the OS).

All of this is not relevent - filesystem as a concept is independent
of the underlying block storage mechanisms, and the beauty of
SR-IOV is that the VFs' look to the guest like PFs, so the guest
just uses the standard non-SRIOV driver that matches the
vendor/device id in the configuration space for the VF.

So, when the filesystem needs a block of storage, the filesystem
will simply initiate a request to the VF as if it were a PF; whether
it is a NVMe adapter, fiberchannel adapter, or NIC with SCSI-over-IP.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Sun May 19 20:34:03 2024

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits with two MS
bits != '00'. I don't see how we would ever need more than 229 bits fed
into accumulation phase and into following normalizer. Of course, all
bits that are lower that LS bit have to be collapsed (by OR) into LS
bit. May be, even less than 229 bits will do, by now I am not sure.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Sun May 19 20:52:03 2024

Terje Mathisen wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm guessing

that this could at least be close to needing an extra cycle on its own
and/or heroic hardware?

If you organize the multiplications and accumulations from most
significance
towards least significance, this wide effect is pipelined away, because
you initialize the accumulation with the augend and check for zero as multiplies fall out of the tree.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Sun May 19 21:07:49 2024

Michael S wrote:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?

Terje

Why so wide?

Consider a 128-bit FP container.
1-bit sign
15-bit exponent
113-bit fraction

The augend can be larger than what comes out of the multiplier, the
worst
case is 113-bits bigger (any bigger and the result of the tree becomes irrelevant.)
The augend can be smaller than what comes out of the multiplier, the
worst case lines up below the lowest bit that comes out of the tree
(otherwise it would not participate in rounding).

Thus we have:
113-bit register below the multiplier,
225-bit product
113-bit incremented above the multiplier.
-------
450-bits. //this might be 2-bits wider than necessary.

BTW its 207-bits for DP.

Assuming that subnormal multiplier inputs are normalized before multiplication,

Bad assumption for HW, maybe acceptable for SW.

the product of multiplication is 226 bits with two MS
bits != '00'. I don't see how we would ever need more than 229 bits fed
into accumulation phase and into following normalizer.

Augend can be positioned 113-bits above the tree or right below the
tree
thus the above arithmetic. Until you know the augend position, there
must
be circuitry to determine where the HoB is, deNormalized numbers only
perturb this by small amounts of logic.

OH and BTW, one can build a Find-First circuit that is ¼ the size of
the leading zero predictor and no slower when selecting the HoB to
normalize. {Academic papers notwithstanding.}

Of course, all
bits that are lower that LS bit have to be collapsed (by OR) into LS
bit. May be, even less than 229 bits will do, by now I am not sure.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun May 19 21:16:31 2024

BGB wrote:

On 5/19/2024 11:37 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm guessing

that this could at least be close to needing an extra cycle on its own
and/or heroic hardware?

This sort of thing is part of what makes proper FMA hopelessly
expensive.

Getting the LoB correctly rounded showed up the generation prior to
FMAC showing up.

Granted, full FMA also allows faking higher precision using

SIMD vector operations, with math that does not work with double-rounded

FMA instructions.

It also enabled error free floating point calculations, but no existing
FP implementation allows exact FP calculations that do not ALSO SET the
inexact flag !?!? {Whereas My 66000 gets this right}

Well, and also an issue if one can "just barely" afford to have a single

double-precision unit.

This is NOT an architectural issue, but an implementation choice issue.

Though, the trick of possibly having four 27-bit multiplies which
combine into a virtual 54 bit multiplier seems like an interesting possibility, though not great as DSP's don't natively handle this size
(and would be too expensive to stretch it out with LUTs). Likely, one
would need to build it from 34*34->68 bit multipliers (each costing 4
DSPs).

This is your implementation choice coloring what you take as
architectural
decisions.

In terms of DSP cost, it would be higher than the current solution:
16 vs 6+4 (10).
But, possibly lower LUT cost (in both the Binary32 and Binary64
multipliers, the shortfall is made up using smaller LUT-based
multipliers).

We can now fit (5nm) hundreds of GBOoO cores on a single die. The
difference
between a 53×53 tree and a 64×64 tree (makes all problems vanish) is
not
visible at this level (100+ cores on a die).

This is your implementation choice coloring you thoughts.

Though, with the combiner option, one could make a case for, say, a:
S.E15.F66.Z46 format (Z=zeroed/ignored).

Well, and/or accept the wonk of a Binary128 which produces 112 bits of mantissa, but only uses the high 66 bits or so, but generally this was
worse for some things in some tests than one which simply zeroes the low-order bits.

But it allows for exact FP arithmetic, and for FMAC, ..... and lots of
other good properties.

What kind of car do you drive ??

But, OTOH, 66*66->112 would allow for possible trickery to fake a full Binary128 FMUL in software as a multi-part process (when combined with a

Binary128 FADD).

A 1-bit wide machine can perform 128 × 128 + 128 FMACs -- it just takes
more
time.

....

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon May 20 00:10:48 2024

BGB wrote:

On 5/19/2024 4:16 PM, MitchAlsup1 wrote:

BGB wrote:

On 5/19/2024 11:37 AM, Terje Mathisen wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing

that this could at least be close to needing an extra cycle on its
own and/or heroic hardware?

This sort of thing is part of what makes proper FMA hopelessly
expensive.

Getting the LoB correctly rounded showed up the generation prior to
FMAC showing up.

Well, in this case, I have neither in a proper sense.

FMAC operators were sorta faked, but mostly exist because they were
needed for RV64G, but double-rounded (and not able to expose anything
that exists below the ULP, unlike proper FMA).

But FMAC can expose the bits below LoB.

Granted, full FMA also allows faking higher precision using

SIMD vector operations, with math that does not work with
double-rounded

FMA instructions.

It also enabled error free floating point calculations, but no existing
FP implementation allows exact FP calculations that do not ALSO SET the
inexact flag !?!? {Whereas My 66000 gets this right}

Dunno.

It seems like the existence of anything below the ULP justifies setting

the inexact flag...

You misunderstand !!

When one computes 2 Operands that are single wide, and can deliver a
single result twice as wide or a pair of results each single wide,
you are delivering all the bits, so there is no inexact. However, if
you use more than 1 instruction to perform the calculation, then, you
HAVE to set an inexact bit even though the delivery of the second
result makes the first setting of the inexact bit in error !!

My ISA is expressive enough to do this, just like IEEE 754-2019
requires on augmented addition and augmented subtraction.

Well, and also an issue if one can "just barely" afford to have a
single

double-precision unit.

This is NOT an architectural issue, but an implementation choice issue.

Absent things like microcode or traps, architectural and implementation

choices are closely tied together. Can't have instructions for things
which one can't afford the hardware cost to implement.

I understand your limitations--the problem I have is that you express
your limitations AS-IF others should make the same choices you had to
make. And that is patently FALSE !!

Defending an indefensible position under the illusion that "That's all I

got to work with" is an insufficient defense against someone who has
more.

Well, and the usefulness of an FPU is dependent on performance.
Inaccurate FPU can still be useful, but slow FPU is not.

Kahan has several lectures about this....

Though, the trick of possibly having four 27-bit multiplies which
combine into a virtual 54 bit multiplier seems like an interesting
possibility, though not great as DSP's don't natively handle this size
(and would be too expensive to stretch it out with LUTs). Likely, one
would need to build it from 34*34->68 bit multipliers (each costing 4
DSPs).

This is your implementation choice coloring what you take as
architectural
decisions.

In terms of DSP cost, it would be higher than the current solution:
16 vs 6+4 (10).
But, possibly lower LUT cost (in both the Binary32 and Binary64
multipliers, the shortfall is made up using smaller LUT-based
multipliers).

We can now fit (5nm) hundreds of GBOoO cores on a single die. The
difference between a 53×53 tree and a 64×64 tree (makes all problems
vanish) is
not
visible at this level (100+ cores on a die).

This is your implementation choice coloring you thoughts.

I can afford FPGAs...
I can't afford to get an ASIC made.

I am not asking you to spend big money--I am merely asking you to quit defending "doing the wrong thing" when others have to follow standards.
{{If you properly caveated all your defense statements--I would not
complain.}}

So, implementation choices here are:
FPGA;
Nothing.

I have been wondering for a while--are the DSP things you build your
multiplier out of synthesized by Verliog compilation, or hard coded
into the gates themselves ?? Because if they are synthesized, you could
create Verilog that builds the multiplier tree of whatever size you
need
without all the DSP overhead.

What kind of car do you drive ??

I don't drive a car...
I tend to fairly rapidly get tired out if trying to drive.

I was going to ask if your car had hand rolled windows, a manual
transmission, ... in the early 1980s all of us were similarly
constrained, computer architecture grew out of the fast-and-dirty
modus operandi and into the follow-standards Operandi.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Mon May 20 09:24:16 2024

Michael S wrote:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before

They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Mon May 20 11:30:45 2024

On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it
came to around 18 cycles per FMA. Compared to the 13 cycles for
the FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so
it needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra
cycle on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before

They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.

Terje

1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user code,
rather than in test vector, is lower than DP by another order of
magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it can
be achieved by introduction of one more pipeline stage everywhere.
Since we are discussing high-latency design akin to POWER9, the
relative cost of another stage would be lower. BTW, according to POWER9
manual, even for SP/DP FMA the latency is not constant. It varies from
5 to 7.

So, IMHO, what you do to handle sub-normal inputs should depend on what
ends up smaller or faster, not on some abstract principles. For less
important unit, like binary128, 'smaller' would likely take
relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.

Besides, pre-normalization vs wider post-normalization are not the only available choices. When multiplier is naturally segmented into 57-bit
section, there exists, for example, an option of pre-normalization by
full section. It looks very simple on the front and saves quite a lot
of shifter's width on the back.

But the best option is probably described in above post by Mitch. If I understood his post correctly, he suggests to have two alignment stages:
one after multiplication and another one after add/sub. The shift count
for a first stage is calculated from inputs in parallel with
multiplication. The first alignment stage does not try to achieve a
perfect normalizations, but it does enough for cutting the width of
following adder from 3N to 2N+eps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Mon May 20 10:12:45 2024

Michael S <[email protected]> schrieb:

On Sun, 19 May 2024 11:17:41 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.

I.e. your actual running frequency was 3700 MHz?

Approximately, yes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Mon May 20 10:56:48 2024

Michael S <[email protected]> writes:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before >multiplication, the product of multiplication is 226 bits

The product of the mantissa multiplication is at most 226 bits even if
you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into
play for that case (for something like the product being
0111111... and the addend being -10000000...).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Anton Ertl on Mon May 20 14:28:50 2024

Anton Ertl wrote:

Michael S <[email protected]> writes:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits

The product of the mantissa multiplication is at most 226 bits even if
you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into
play for that case (for something like the product being
0111111... and the addend being -10000000...).

This is the part of Mitch's explanation that I have never been able to
totally grok, I do think you could get away with less bits, but only if
you can collapse the extra mantissa bits into sticky while aligning the
product with the addend. If that takes too long or it turns out to be easier/faster in hardware to simply work with a much wider mantissa,
then I'll accept that.

I don't think I've ever seen Mitch make a mistake on anything like this!

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Mon May 20 15:36:30 2024

On Mon, 20 May 2024 14:22:00 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it
came to around 18 cycles per FMA. Compared to the 13 cycles for
the FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would
be an entirely different beast, I would expect a throughput of
1 per cycle and a latency of (maybe) one cycle more than 64-bit
FMA.

The FMA normalizer has to handle a maximally bad cancellation, so
it needs to be around 350 bits wide. Mitch knows of course but
I'm guessing that this could at least be close to needing an
extra cycle on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before

They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.

Terje

1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user
code, rather than in test vector, is lower than DP by another order
of magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it
can be achieved by introduction of one more pipeline stage
everywhere. Since we are discussing high-latency design akin to
POWER9, the relative cost of another stage would be lower. BTW,
according to POWER9 manual, even for SP/DP FMA the latency is not
constant. It varies from 5 to 7.

So, IMHO, what you do to handle sub-normal inputs should depend on
what ends up smaller or faster, not on some abstract principles.
For less important unit, like binary128, 'smaller' would likely take relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.

Besides, pre-normalization vs wider post-normalization are not the
only available choices. When multiplier is naturally segmented into
57-bit section, there exists, for example, an option of
pre-normalization by full section. It looks very simple on the
front and saves quite a lot of shifter's width on the back.

But the best option is probably described in above post by Mitch.
If I understood his post correctly, he suggests to have two
alignment stages: one after multiplication and another one after
add/sub. The shift count for a first stage is calculated from
inputs in parallel with multiplication. The first alignment stage
does not try to achieve a perfect normalizations, but it does
enough for cutting the width of following adder from 3N to 2N+eps.

I do agree with Mitch's suggestion: Allow subnormal inputs but do the
partial muls from the top and move the normalization starting point
down for each all-zero input block.

In an extreme case (subnormal x subnormal) this would allow you to
discard a lot of partial products.

Terje

For subnormal x subnormal you don't need result of multiplication at
all. All you need to know is if it's zero or not and what sign.
Even that is needed only in non-default rounding modes and for inexact
flag in default mode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Mon May 20 14:22:00 2024

Michael S wrote:

On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it
came to around 18 cycles per FMA. Compared to the 13 cycles for
the FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.

The FMA normalizer has to handle a maximally bad cancellation, so
it needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra
cycle on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before

They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.

Terje

1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user code,
rather than in test vector, is lower than DP by another order of
magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it can
be achieved by introduction of one more pipeline stage everywhere.
Since we are discussing high-latency design akin to POWER9, the
relative cost of another stage would be lower. BTW, according to POWER9 manual, even for SP/DP FMA the latency is not constant. It varies from
5 to 7.

So, IMHO, what you do to handle sub-normal inputs should depend on what
ends up smaller or faster, not on some abstract principles. For less important unit, like binary128, 'smaller' would likely take
relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.

Besides, pre-normalization vs wider post-normalization are not the only available choices. When multiplier is naturally segmented into 57-bit section, there exists, for example, an option of pre-normalization by
full section. It looks very simple on the front and saves quite a lot
of shifter's width on the back.

But the best option is probably described in above post by Mitch. If I understood his post correctly, he suggests to have two alignment stages:
one after multiplication and another one after add/sub. The shift count
for a first stage is calculated from inputs in parallel with
multiplication. The first alignment stage does not try to achieve a
perfect normalizations, but it does enough for cutting the width of
following adder from 3N to 2N+eps.

I do agree with Mitch's suggestion: Allow subnormal inputs but do the
partial muls from the top and move the normalization starting point down
for each all-zero input block.

In an extreme case (subnormal x subnormal) this would allow you to
discard a lot of partial products.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Chris M. Thomasson on Mon May 20 11:44:19 2024

Chris M. Thomasson wrote:

On 5/19/2024 3:08 PM, Chris M. Thomasson wrote:

On 5/19/2024 3:04 PM, Chris M. Thomasson wrote:

On 5/19/2024 2:55 PM, Chris M. Thomasson wrote:
[...]

I remember a little test that Microsoft made wrt 50,000 concurrent
OVERLAPPED ops in IOCP vs an event driven model actually creating a
windows event per connection multiplexing WFMO in several threads.
The event model did not perform as well, but it did not do too bad
either. I wonder if I can still find that paper. Back in 2002 or
something. Hard to remember right now.

I am having trouble finding it. I do remember:

https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc959494(v=technet.10)?redirectedfrom=MSDN

I just found an old post from me back in 2003 with a link to the paper:
___________________
You can get 50,000+ concurrent connections using IOCP, check out the
following link:

http://www.microsoft.com/mspress/books/sampchap/5726a.asp?#128

You do have to do some memory management to get there, like posting zero
byte receives to ensure that pending recvs don't lock their buffers,
you can
also restrict the amount of pending sends the server has all together
[...]
___________________

The way back machine found it, I think!

https://web.archive.org/web/20030216222720/https://www.microsoft.com/mspress/books/sampchap/5726a.asp#128

Nice!

Thanks. I have never used the thread-per-client model in my servers.
I have been using async I/O and Asynchronous Procedure Calls (APC)
for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP),
which were added to Windows later, have similar functionality
(perhaps IOCP might have slightly better scaling with many cores).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon May 20 18:04:42 2024

BGB wrote:

On 5/19/2024 7:10 PM, MitchAlsup1 wrote:

Kahan has several lectures about this....

There have been apparently more things killed off by slow performance
than by lack of FPU accuracy.

Say, at the time, performance apparently killed off:
Amiga (killed off by its slow graphics)
Bit planar graphics rather sucking if one wants fast screen
redraws;
M68K, killed off for being too slow vs x86;
Cyrix, because its Pentium equivalent was slow at running Quake;
...

Mc68K (most of it at least) is living out its days as an automotive
engine controller.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Mon May 20 21:17:15 2024

Michael S wrote:

On Mon, 20 May 2024 14:22:00 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

Thomas Koenig wrote:

So, I did some more measurements on the POWER9 machine, and it
came to around 18 cycles per FMA. Compared to the 13 cycles for >>>>>>> the FMA instruction, this actually sounds reasonable.

The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as >>>>>>> an afterthought, and decimal multiplication does not happen all
that often.

A fully pipelined FMA unit capable of 128-bit arithmetic would
be an entirely different beast, I would expect a throughput of
1 per cycle and a latency of (maybe) one cycle more than 64-bit
FMA.

The FMA normalizer has to handle a maximally bad cancellation, so
it needs to be around 350 bits wide. Mitch knows of course but
I'm guessing that this could at least be close to needing an
extra cycle on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before

They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.

Terje

1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user
code, rather than in test vector, is lower than DP by another order
of magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it
can be achieved by introduction of one more pipeline stage
everywhere. Since we are discussing high-latency design akin to
POWER9, the relative cost of another stage would be lower. BTW,
according to POWER9 manual, even for SP/DP FMA the latency is not
constant. It varies from 5 to 7.

So, IMHO, what you do to handle sub-normal inputs should depend on
what ends up smaller or faster, not on some abstract principles.
For less important unit, like binary128, 'smaller' would likely take
relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.

Besides, pre-normalization vs wider post-normalization are not the
only available choices. When multiplier is naturally segmented into
57-bit section, there exists, for example, an option of
pre-normalization by full section. It looks very simple on the
front and saves quite a lot of shifter's width on the back.

But the best option is probably described in above post by Mitch.
If I understood his post correctly, he suggests to have two
alignment stages: one after multiplication and another one after
add/sub. The shift count for a first stage is calculated from
inputs in parallel with multiplication. The first alignment stage
does not try to achieve a perfect normalizations, but it does
enough for cutting the width of following adder from 3N to 2N+eps.

I do agree with Mitch's suggestion: Allow subnormal inputs but do the
partial muls from the top and move the normalization starting point
down for each all-zero input block.

In an extreme case (subnormal x subnormal) this would allow you to
discard a lot of partial products.

Terje

For subnormal x subnormal you don't need result of multiplication at
all. All you need to know is if it's zero or not and what sign.
Even that is needed only in non-default rounding modes and for inexact
flag in default mode.

Yeah, Mea Culpa! I did correct that particular brain fart a few minutes
later in my subsequent post, but it is not possible for the
multiplication to produce a result far below the subnormal limit.

As you note, it is only when using RoundToPlus (or Minus) Infinity that
an arbitrary small product can still produce a non-zero result.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Mon May 20 19:58:18 2024

BGB wrote:

On 5/20/2024 7:36 AM, Michael S wrote:

For subnormal x subnormal you don't need result of multiplication at
all. All you need to know is if it's zero or not and what sign.
Even that is needed only in non-default rounding modes and for inexact
flag in default mode.

For most non-tiny formats, the seeming advantage of subnormal numbers
seems small, in any case.

There is, it is called Posit (or UNUM depending).
No subnormals, wider range then IEEE, more precision than IEEE
(most of the time). Whether it is better overall is still a
matter of debate. It is harder to implement than IEEE but
just barely.

But, yeah, in any case I would almost prefer if there could be a separate/cheaper standard, probably mostly aimed at
embedded/microcontroller style use-cases (rather than "general
purpose"), and would likely relax the requirements a fair bit.

Say, likely target might be, say:
FADD/FSUB/FMUL;
Binary16 and Binary32 as high-priority formats;
Binary64 as optional (but nice to have);
Probably DAZ/FTZ;
Potentially allow for truncate-only rounding.

Assumption being that larger or higher precision cases would fall back
to software emulation.

Could optionally have some 8-bit FP formats, but 8-bit FP is a little
bit too limited for general-purpose use.

Likely main candidates being:
S.E4.F3 (Bias=7)
S.E3.F4 (Bias=7|8, ~ Unit Range)
More or less A-Law without the XOR.
Though, A-Law can also be interpreted as a ~ 12 bit integer value.
Annoyingly, exact bias depends on context for this one
(eg: 8/7/3/0)...

I had also used:
E4.F4
E4.F3.S
But, this is wonky (and the possible merit of E4.F3.S is defeated once
one also needs S.E4.F3 or S.E3.F4, as these are the "actually used in
the wild" formats, so may have been a mistake).

I spent some of my youth trying to push against immovable objects
(i.e., standards) don't do it, it is a waste of effort and time,
similar to putting lipstick on a pig.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to [email protected] on Mon May 20 21:56:22 2024

MitchAlsup1 <[email protected]> schrieb:

BGB wrote:

On 5/20/2024 7:36 AM, Michael S wrote:

For subnormal x subnormal you don't need result of multiplication at
all. All you need to know is if it's zero or not and what sign.
Even that is needed only in non-default rounding modes and for inexact
flag in default mode.

For most non-tiny formats, the seeming advantage of subnormal numbers
seems small, in any case.

There is, it is called Posit (or UNUM depending).
No subnormals, wider range then IEEE, more precision than IEEE
(most of the time). Whether it is better overall is still a
matter of debate. It is harder to implement than IEEE but
just barely.

My guess is that it will never catch on. Having accuracy depend
on the number range is an idea that people who prove things about
numerical algorithms tend to dislike.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Tue May 21 05:49:55 2024

BGB <[email protected]> schrieb:

Granted, they are not necessarily the option one would go if they wanted "cheapest possible FPU that is still good enough to be usable".

Though, the point at which an FPU manages to suck badly enough that one
needs to resort to software emulation to make software work, is probably
a lower limit.

Luckily, "uses 754 formats, but with aggressive cost cutting" can be
"good enough", and so long as they more-or-less deliver a full width mantissa, and can exactly compute exact-value calculations, most
software is generally going to work.

This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.

But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so there
is a lower limit here.

An example of a more interesting question is

if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue May 21 10:46:59 2024

On Mon, 20 May 2024 21:17:15 +0200
Terje Mathisen <[email protected]> wrote:

As you note, it is only when using RoundToPlus (or Minus) Infinity
that an arbitrary small product can still produce a non-zero result.

Terje

I think, we were discussing multiplication stage of FMA rather than multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Thomas Koenig on Tue May 21 11:19:03 2024

On Tue, 21 May 2024 05:49:55 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

BGB <[email protected]> schrieb:

Granted, they are not necessarily the option one would go if they
wanted "cheapest possible FPU that is still good enough to be
usable".

Though, the point at which an FPU manages to suck badly enough that
one needs to resort to software emulation to make software work, is probably a lower limit.

Luckily, "uses 754 formats, but with aggressive cost cutting" can
be "good enough", and so long as they more-or-less deliver a full
width mantissa, and can exactly compute exact-value calculations,
most software is generally going to work.

This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.

But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
there is a lower limit here.

An example of a more interesting question is

if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}

If I am not mistaken, that should hold on VAX, which has floating-point
very close to BGB ideal. It looks like it would hold even on less
robust formats, like IBM's hex float. I wonder where it is not true?

The biggest difference between IEEE and VAX is that on IEEE when (a > b)
then (a - b > 0) while on VAX (a - b >= 0).

Of course, IEEE has non-intuitive cases as well.
if (!(a < 0)) {
if (!(b < 0)) {
if (!(a + b >= a)) {
printf("It's IEEE 754, baby!\n);
}
}
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Tue May 21 13:18:32 2024

Michael S wrote:

On Tue, 21 May 2024 05:49:55 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

BGB <[email protected]> schrieb:

Granted, they are not necessarily the option one would go if they
wanted "cheapest possible FPU that is still good enough to be
usable".

Though, the point at which an FPU manages to suck badly enough that
one needs to resort to software emulation to make software work, is
probably a lower limit.

Luckily, "uses 754 formats, but with aggressive cost cutting" can
be "good enough", and so long as they more-or-less deliver a full
width mantissa, and can exactly compute exact-value calculations,
most software is generally going to work.

This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.

But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
there is a lower limit here.

An example of a more interesting question is

if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}

If I am not mistaken, that should hold on VAX, which has floating-point
very close to BGB ideal. It looks like it would hold even on less
robust formats, like IBM's hex float. I wonder where it is not true?

The biggest difference between IEEE and VAX is that on IEEE when (a > b)
then (a - b > 0) while on VAX (a - b >= 0).

Of course, IEEE has non-intuitive cases as well.
if (!(a < 0)) {
if (!(b < 0)) {
if (!(a + b >= a)) {
printf("It's IEEE 754, baby!\n);
}
}
}

What happens when a and/or b is a NaN?

Comparisons with NaN should return false, so !(NaN < 0) will be true
(and the same for b), while !(NaN+b >= NaN) will also return true.

Is that what you were thinking of?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Tue May 21 14:28:01 2024

On Tue, 21 May 2024 13:18:32 +0200
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Tue, 21 May 2024 05:49:55 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

BGB <[email protected]> schrieb:

Granted, they are not necessarily the option one would go if they
wanted "cheapest possible FPU that is still good enough to be
usable".

Though, the point at which an FPU manages to suck badly enough
that one needs to resort to software emulation to make software
work, is probably a lower limit.

Luckily, "uses 754 formats, but with aggressive cost cutting" can
be "good enough", and so long as they more-or-less deliver a full
width mantissa, and can exactly compute exact-value calculations,
most software is generally going to work.

This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.

But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
there is a lower limit here.

An example of a more interesting question is

if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}

If I am not mistaken, that should hold on VAX, which has
floating-point very close to BGB ideal. It looks like it would hold
even on less robust formats, like IBM's hex float. I wonder where
it is not true?

The biggest difference between IEEE and VAX is that on IEEE when (a

b) then (a - b > 0) while on VAX (a - b >= 0).

Of course, IEEE has non-intuitive cases as well.
if (!(a < 0)) {
if (!(b < 0)) {
if (!(a + b >= a)) {
printf("It's IEEE 754, baby!\n);
}
}
}

What happens when a and/or b is a NaN?

Comparisons with NaN should return false, so !(NaN < 0) will be true
(and the same for b), while !(NaN+b >= NaN) will also return true.

Is that what you were thinking of?

Yes

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue May 21 17:17:10 2024

BGB wrote:

But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so there

is a lower limit here.

1.0 has a fraction of 0
2.0 has a fraction of 0
1.0+2.0 has a fraction with a single HOB set.
all 3 above examples have the hidden bit set.

Any implementation purporting to be IEEE 754 better not give anything
other
than 3.0 !!

Bad example. Even IBM and CRAY arithmetics, with all their problems,
were
not that bad.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue May 21 17:11:22 2024

Michael S wrote:

On Mon, 20 May 2024 21:17:15 +0200
Terje Mathisen <[email protected]> wrote:

As you note, it is only when using RoundToPlus (or Minus) Infinity
that an arbitrary small product can still produce a non-zero result.

Terje

I think, we were discussing multiplication stage of FMA rather than multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).

Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.

Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.

But that ship sailed 50 years ago.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue May 21 17:19:02 2024

BGB wrote:

Errm, I was promoting the idea of cost-cut floating point, not blatantly

broken floating point...

Would you promote the idea where the customer could specify whether his
car
had air bags and crash safety cell or not ??

Same point here.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Michael S on Tue May 21 17:53:47 2024

Michael S <[email protected]> schrieb:

On Tue, 21 May 2024 05:49:55 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:

BGB <[email protected]> schrieb:

Granted, they are not necessarily the option one would go if they
wanted "cheapest possible FPU that is still good enough to be
usable".

Though, the point at which an FPU manages to suck badly enough that
one needs to resort to software emulation to make software work, is
probably a lower limit.

Luckily, "uses 754 formats, but with aggressive cost cutting" can
be "good enough", and so long as they more-or-less deliver a full
width mantissa, and can exactly compute exact-value calculations,
most software is generally going to work.

This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.

But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
there is a lower limit here.

An example of a more interesting question is

if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}

If I am not mistaken, that should hold on VAX, which has floating-point
very close to BGB ideal. It looks like it would hold even on less
robust formats, like IBM's hex float. I wonder where it is not true?

IIRC, something like that was possible when mixing 80-bit and 64-bit
quantities on x387.

But the code I posted probably does not qualify for this.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue May 21 15:56:47 2024

I think, we were discussing multiplication stage of FMA rather than
multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).

Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.
Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.
But that ship sailed 50 years ago.

Wouldn't that just kick the problem down the street?
For example, what should `x < y` return when both `x` and `y` are "infinity+epsilon"?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Tue May 21 22:20:20 2024

Stefan Monnier wrote:

I think, we were discussing multiplication stage of FMA rather than
multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).

Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.
Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.
But that ship sailed 50 years ago.

Wouldn't that just kick the problem down the street?
For example, what should `x < y` return when both `x` and `y` are "infinity+epsilon"?

You mean -infinity+epsilon or +infinity-epsilon. +infinity+epsilon
is +infinity ...

IEEE 754 has +infinity == +infinity && -infinity == -infinity

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue May 21 22:12:43 2024

BGB wrote:

Currently, FCMPEQ will be false on NaN, whereas FCMPGT ignores NaN's.

I guess possible could be to add an FCMPGE instruction, and then make
both FCMPGT and FCMPGE be false on NaN (where the LT and LE cases can be

handled by flipping the arguments, so would still be false on NaN).

So, as-is:
if(!(a==a))
{
//NaN
}

But, as-is:
if(a>0)
{
//May still potentially get here with NaN
}

Not when the compare is true to IEEE 754. When there is a floating
point
compare and one of the operands is NaN, none of the 6 standard
comparison
forms are true and control transfers to the else-clause.

In your top example the ! (not) causes NaNs to go to the then-clause

In your bottom example, no NaN is allowed into the then-clause.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed May 22 09:12:55 2024

MitchAlsup1 wrote:

BGB wrote:

Errm, I was promoting the idea of cost-cut floating point, not blatantly

broken floating point...

Would you promote the idea where the customer could specify whether his
car had air bags and crash safety cell or not ??

Same point here.

We do have that in the form of having optional air bags: When we bought
a Skoda Octavia many years ago, the head-protecting upper air bags were
not required by law so we paid to get them as an option.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed May 22 09:10:40 2024

MitchAlsup1 wrote:

Michael S wrote:

On Mon, 20 May 2024 21:17:15 +0200
Terje Mathisen <[email protected]> wrote:

As you note, it is only when using RoundToPlus (or Minus) Infinity
that an arbitrary small product can still produce a non-zero result.

Terje

I think, we were discussing multiplication stage of FMA rather than
multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).

Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.

Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.

But that ship sailed 50 years ago.

Not entirely: I was recently very surprised to learn that in non-default rounding modes, you can in fact get behavior close to but not quite what
you want. I.e. if rounding would cause overflow from maximally large
normal to infinity, then the rounding up is suppressed.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed May 22 10:22:33 2024

MitchAlsup1 wrote:

Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.

Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.

But that ship sailed 50 years ago.

If an overflow occurs and that exception is masked, x86 returns a
value of either +-largest finite number (LFN) or +-infinity (INF),
depending on the rounding mode.
(vol-1 Arch manual, section 4.9.1.4, Numeric Overflow Exception,
Table 4-10. Masked Responses to Numeric Overflow)

Rounding_Mode Sign_of_Result Result
------------- -------------- -------------------------------
To nearest + +∞
– –∞
Toward –∞ + Largest finite positive number
– –∞
Toward +∞ + +∞
– Largest finite negative number
Toward zero + Largest finite positive number
– Largest finite negative number

The difference seems to be that INF is a sticky overflow, LFN is not.
Would this not satisfy everyone?

The problem is that it requires diddling the control register to change
the round mode, as opposed to round mode on each float instruction.
Or maybe even the LFN/INF overflow choice should be a separate option independent of round control bits.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Wed May 22 16:52:32 2024

EricP wrote:

MitchAlsup1 wrote:

Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.

Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.

But that ship sailed 50 years ago.

If an overflow occurs and that exception is masked, x86 returns a
value of either +-largest finite number (LFN) or +-infinity (INF),
depending on the rounding mode.
(vol-1 Arch manual, section 4.9.1.4, Numeric Overflow Exception,
Table 4-10. Masked Responses to Numeric Overflow)

Rounding_Mode Sign_of_Result Result
------------- -------------- -------------------------------
To nearest + +∞
– –∞
Toward –∞ + Largest finite positive number
– –∞
Toward +∞ + +∞
– Largest finite negative number
Toward zero + Largest finite positive number
– Largest finite negative number

The difference seems to be that INF is a sticky overflow, LFN is not.
Would this not satisfy everyone?

The problem is that it requires diddling the control register to change
the round mode, as opposed to round mode on each float instruction.
Or maybe even the LFN/INF overflow choice should be a separate option independent of round control bits.

Yes, what we need is a rounding mode that suppresses overflow and
underflow but is otherwise RNE.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Thu May 23 21:39:12 2024

BGB-Alt wrote:

On 5/20/2024 7:28 AM, Terje Mathisen wrote:

Anton Ertl wrote:

Michael S <[email protected]> writes:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

The FMA normalizer has to handle a maximally bad cancellation, so it >>>>> needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle >>>>> on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits

The product of the mantissa multiplication is at most 226 bits even if
you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into
play for that case (for something like the product being
0111111... and the addend being -10000000...).

This is the part of Mitch's explanation that I have never been able to
totally grok, I do think you could get away with less bits, but only if

you can collapse the extra mantissa bits into sticky while aligning the

product with the addend. If that takes too long or it turns out to be
easier/faster in hardware to simply work with a much wider mantissa,
then I'll accept that.

I don't think I've ever seen Mitch make a mistake on anything like
this!

It is a mystery, though seems like maybe Binary128 FMA could be done in

software via an internal 384-bit intermediate?...

My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to
228 bits. If one adds another 116 bits (for maximal FADD), this comes
to

344.

Maximal product with minimal augend::

pppppppp-pppppppp-aaaaaaaa

Maximal augend with minimal product

aaaaaaaa-pppppppp-pppppppp

So the way one builds HW is to have the augend shifter cover the whole
4×
length and place the product in the middle::

max min
aaaaaaaa-aaaaaaaa-aaaaaaaa-aaaaaaaa
pppppppp-pppppppp

The output of the product is still in carry-save form and the augend is
in pure binary so the adder is 3-input for 2×-width. This generates a
carry into the high order incrementor.

So one has a sticky generator for the right hand side augend, and an incrementor for the left hand side augend. When doing high speed de-
normals one cannot count on the left hand side of product to have HoBs
set with standard ramifications (imaging a denorm product and a denorm
augend and you want the right answer.)

Any way you cook it, you have a 4× wide intermediate (minus 2-bits
IIRC).
4×112 = 448 -2 = 446.

There is a reason these things are not standard at this point of
technology.

Could you do it (IEEE accuracy) with less HW--yes, but only if you
allow
certain special cases to take more cycles in calculation. At a certain
point (a point made by Terje) it is easier to implement with wide
integer
calculations 128+128 and/or 128*128 along with double width shifts,
inserts,
and extracts.

IEEE did not make these things any easier by having a 2× std width
fraction
have 2×+3 bits of length requiring 8 multiplications with minimal HW
instead
of 4 multiplications. On the other hand IBM did us no favors with Hex
FP
either (keeping the exponent size the same and having 2×+8 bits of
fraction.)

In this case, 384 bits would be because my "_BitInt" support code pads
things to a multiple of 128 bits (for integer types larger than 256
bits).

It isn't fast, but I am not against having Binary128 being slower,
since

if one is using Binary128 ("long double" or "__float128" in this case),

it is likely the case that precision is more a priority than speed.

Though, as of yet, there is no Binary128 FMA operation (in the software

runtime). Could potentially add this in theory.

I guess, maybe also possible could be whether to add the
FADDX/FMULX/FMACX instructions in a form where they are allowed, but
will be turned into runtime traps (would likely route them through the
TLB Miss ISR, which thus far has ended up as a catch-all for this sort
of thing...).

Though, likely more efficient would still be "just use the runtime
calls".

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Fri May 24 09:07:35 2024

MitchAlsup1 wrote:

BGB-Alt wrote:

On 5/20/2024 7:28 AM, Terje Mathisen wrote:

Anton Ertl wrote:

Michael S <[email protected]> writes:

On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:

The FMA normalizer has to handle a maximally bad cancellation, so it >>>>>> needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle >>>>>> on its own and/or heroic hardware?

Terje

Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits

The product of the mantissa multiplication is at most 226 bits even if >>>> you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into >>>> play for that case (for something like the product being
0111111... and the addend being -10000000...).

This is the part of Mitch's explanation that I have never been able
to totally grok, I do think you could get away with less bits, but
only if

you can collapse the extra mantissa bits into sticky while aligning the

product with the addend. If that takes too long or it turns out to be
easier/faster in hardware to simply work with a much wider mantissa,
then I'll accept that.

I don't think I've ever seen Mitch make a mistake on anything like
this!

It is a mystery, though seems like maybe Binary128 FMA could be done in

software via an internal 384-bit intermediate?...

My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to
228 bits. If one adds another 116 bits (for maximal FADD), this comes
to

344.

Maximal product with minimal augend::

    pppppppp-pppppppp-aaaaaaaa

Maximal augend with minimal product

    aaaaaaaa-pppppppp-pppppppp

So the way one builds HW is to have the augend shifter cover the whole
4×
length and place the product in the middle::

       max                        min
    aaaaaaaa-aaaaaaaa-aaaaaaaa-aaaaaaaa
             pppppppp-pppppppp

The output of the product is still in carry-save form and the augend is
in pure binary so the adder is 3-input for 2×-width. This generates a
carry into the high order incrementor.

So one has a sticky generator for the right hand side augend, and an incrementor for the left hand side augend. When doing high speed de-
normals one cannot count on the left hand side of product to have HoBs
set with standard ramifications (imaging a denorm product and a denorm
augend and you want the right answer.)

Any way you cook it, you have a 4× wide intermediate (minus 2-bits
IIRC).
4×112 = 448 -2 = 446.
There is a reason these things are not standard at this point of
technology.

So this is basically due to the product part still being in carry-save
format, so it cannot easily be moved/aligned, instead the augend has to
be able to move to either side of it. OK, that makes sense!

Could you do it (IEEE accuracy) with less HW--yes, but only if you
allow
certain special cases to take more cycles in calculation. At a certain
point (a point made by Terje) it is easier to implement with wide
integer
calculations 128+128 and/or 128*128 along with double width shifts,
inserts,
and extracts.

IEEE did not make these things any easier by having a 2× std width
fraction have 2×+3 bits of length requiring 8 multiplications with
minimal HW instead of 4 multiplications. On the other hand IBM did us
no favors with Hex FP either (keeping the exponent size the same and
having 2×+8 bits of fraction.)

This is an intentional feature, not a bug!

By making sure that all ieee larger formats have a mantissa with at
least 2n+3 bits compared to the smaller format below, you avoid all
double rounding issues if you do a calculation in the larger format and
then immediately store it back to a smaller format container.

By also having a wider exponent you can do things like sqrt(x^2+y^2) and completely avoid spuriouos overflows during the squaring ops: As long as
the final result fits in float, it will be the correct result.

We started out with 1:8:23 and 1:11:52, then we got 1:15:112 at the
higher end and 1:5:10 for fp16 and 1:3:4 for fp8.

Do note that the 8 and 16-bit variants do break the 2n+3 rule, also note
that the AI training people like truncated 32-bit, i.e. 1:8:7 which
keeps the full float range but with ~1/3 the mantissa resolution.

Anyway, doing fp128 in SW I would of course do it using u64 unsigned
integer ops: FMUL128 becomes 4 64x64->128 MUL ops plus the
adding/merging of the terms and a bunch of book keeping work on the
signs and exponents.

With a single fully pipelined integer multiplier taking 4 cycles, this
would be 7 cycles for the MULs, with the last three cycles overlapped
with the initial ADD/ADC operations. Seems like it could be doable in
sub-20 cycles?

I'm assuming the CPU to be wide enough that the special cases can be
handled in parallel with the default/normal inputs case, also assuming
reg-reg MOVes to be zero cycles, handled in the renamer, in order to
overcome the dedicated register (RDX) issue which we have retained even
using MULX.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Chris M. Thomasson on Fri May 24 09:43:19 2024

Chris M. Thomasson wrote:

On 5/20/2024 8:44 AM, EricP wrote:

Chris M. Thomasson wrote:

On 5/19/2024 3:08 PM, Chris M. Thomasson wrote:

On 5/19/2024 3:04 PM, Chris M. Thomasson wrote:

On 5/19/2024 2:55 PM, Chris M. Thomasson wrote:
[...]

I remember a little test that Microsoft made wrt 50,000 concurrent >>>>>> OVERLAPPED ops in IOCP vs an event driven model actually creating
a windows event per connection multiplexing WFMO in several
threads. The event model did not perform as well, but it did not
do too bad either. I wonder if I can still find that paper. Back
in 2002 or something. Hard to remember right now.

I am having trouble finding it. I do remember:

https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc959494(v=technet.10)?redirectedfrom=MSDN

I just found an old post from me back in 2003 with a link to the paper: >>>> ___________________
You can get 50,000+ concurrent connections using IOCP, check out the
following link:

http://www.microsoft.com/mspress/books/sampchap/5726a.asp?#128

You do have to do some memory management to get there, like posting
zero
byte receives to ensure that pending recvs don't lock their buffers,
you can
also restrict the amount of pending sends the server has all together
[...]
___________________

The way back machine found it, I think!

https://web.archive.org/web/20030216222720/https://www.microsoft.com/mspress/books/sampchap/5726a.asp#128

Nice!

Thanks. I have never used the thread-per-client model in my servers.
I have been using async I/O and Asynchronous Procedure Calls (APC)
for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP),
which were added to Windows later, have similar functionality
(perhaps IOCP might have slightly better scaling with many cores).

I never tried APC wrt IO and a socket server on windows before! I have created several types wrt the book I finally found on the wayback
machine. IOCP was the thing to use.

APC's are I/O completion callback subroutines with 1 to 3 arguments.
I use them to a build callback driven state machines for each network
I/O channel, similar to device drivers. Each network channel requires
only a small amount of user mode memory, and all server network connections
can be serviced by a single thread or a small fixed pool of comms threads.
This keeps the cost for each new connection to mostly just the kernel's
network memory.

WinNT originally only had APC's. It inherited the concept from
VMS's Asynchronous System Trap (AST), which inherited the concept
from RSX-11 on PDP-11.

The difference between Windows APC and RSX/VMS AST is how they are delivered. Despite the name, Windows user mode APC's are NOT delivered to the thread asynchronously as interrupts but only at specified delivery points,
which means user mode APC's are essentially a synchronous polled delivery. whereas VMS AST's are delivered at any time using interrupts semantics. (Windows does have real asynchronous-delivery APC's but inside the kernel
where they are used to interrupt or wake up a thread for I/O completion
and various other things.)

User mode APC's are simpler from a user mode programming point of view than VMS's AST's but because APC's have a polled delivery you can't use user mode APC's to interrupt and force a thread to do something, as AST's can.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB-Alt on Fri May 24 23:29:13 2024

BGB-Alt wrote:

Don't you EVER snip useless crap off the top of your posts ??

In my case, BGBCC supports __int128 operations whether or not the ALUX instructions are enabled (along with _BitInt, *1).

I am guessing that your CPU does not do 128, 256, 384 bit calculations natively, but the compiler supports those by emitting sequences of instructions.

*1:
_BitInt(56) x0; //maps to 64-bit
_BitInt(64) x1; //maps to 64-bit
_BitInt(80) x2; //maps to 128-bit
_BitInt(128) x3; //maps to 128-bit
_BitInt(160) x4; //maps to 256-bit
_BitInt(256) x5; //maps to 256-bit
_BitInt(272) x6; //maps to 384-bit
...
All sizes beyond 256 bit mapping to the next integer multiple of 128
bits. The 256-bit type is special, in that it has its own dedicated
logic, but exists via the _BitInt type. For 384 and beyond, generic
logic is used that deals with any size value, but is slower.

Can note that in my implementation, BitInt does not enforce modulo
behavior in the case of overflow (it is modulo only to the size of the container; enforcing odd-bit modulo behavior would add a fair bit of
cost to using them).

The multi-precision arithmetic in My 66000 supports rather arbitrary
width calculations, although only those 256-bits and smaller can be
considered fast and/or efficient. In addition, both signed and
unsigned multi-precision arithmetic is available.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Sun May 26 16:51:31 2024

On Fri, 24 May 2024 09:43:19 -0400, EricP
<[email protected]> wrote:

Chris M. Thomasson wrote:

On 5/20/2024 8:44 AM, EricP wrote:

Thanks. I have never used the thread-per-client model in my servers.
I have been using async I/O and Asynchronous Procedure Calls (APC)
for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP), >>> which were added to Windows later, have similar functionality
(perhaps IOCP might have slightly better scaling with many cores).

I never tried APC wrt IO and a socket server on windows before! I have
created several types wrt the book I finally found on the wayback
machine. IOCP was the thing to use.

APC's are I/O completion callback subroutines with 1 to 3 arguments.
I use them to a build callback driven state machines for each network
I/O channel, similar to device drivers. Each network channel requires
only a small amount of user mode memory, and all server network connections >can be serviced by a single thread or a small fixed pool of comms threads. >This keeps the cost for each new connection to mostly just the kernel's >network memory.

WinNT originally only had APC's. It inherited the concept from
VMS's Asynchronous System Trap (AST), which inherited the concept
from RSX-11 on PDP-11.

I can't speak to "originally" as I never used NT3.x, but NT4.x allowed asynchronous I/O calls to signal events on completion (or failure). I
used events with WaitForMultipleObjects [*] to mix file and socket
operations in single-thread servers.

[*] like select() or poll() in Unix. For a long time the Windows
"WaitFor..." calls could NOT directly monitor sockets (sockets were
not files), but they could could monitor user events, and both the
file and socket APIs supported using completion events.

APCs might have been more efficient, but I only ever used them in
conjunction with threads - I never tried to write a single-thread
server that performed operations on multiple files or sockets using
only APC.

The difference between Windows APC and RSX/VMS AST is how they are delivered. >Despite the name, Windows user mode APC's are NOT delivered to the thread >asynchronously as interrupts but only at specified delivery points,
which means user mode APC's are essentially a synchronous polled delivery. >whereas VMS AST's are delivered at any time using interrupts semantics. >(Windows does have real asynchronous-delivery APC's but inside the kernel >where they are used to interrupt or wake up a thread for I/O completion
and various other things.)

User mode APC's are simpler from a user mode programming point of view than >VMS's AST's but because APC's have a polled delivery you can't use user mode >APC's to interrupt and force a thread to do something, as AST's can.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to George Neuner on Mon May 27 10:46:30 2024

George Neuner wrote:

On Fri, 24 May 2024 09:43:19 -0400, EricP
<[email protected]> wrote:

Chris M. Thomasson wrote:

On 5/20/2024 8:44 AM, EricP wrote:

Thanks. I have never used the thread-per-client model in my servers.
I have been using async I/O and Asynchronous Procedure Calls (APC)
for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP), >>>> which were added to Windows later, have similar functionality
(perhaps IOCP might have slightly better scaling with many cores).

I never tried APC wrt IO and a socket server on windows before! I have
created several types wrt the book I finally found on the wayback
machine. IOCP was the thing to use.

APC's are I/O completion callback subroutines with 1 to 3 arguments.
I use them to a build callback driven state machines for each network
I/O channel, similar to device drivers. Each network channel requires
only a small amount of user mode memory, and all server network connections >> can be serviced by a single thread or a small fixed pool of comms threads. >> This keeps the cost for each new connection to mostly just the kernel's
network memory.

WinNT originally only had APC's. It inherited the concept from
VMS's Asynchronous System Trap (AST), which inherited the concept
from RSX-11 on PDP-11.

I can't speak to "originally" as I never used NT3.x, but NT4.x allowed asynchronous I/O calls to signal events on completion (or failure). I
used events with WaitForMultipleObjects [*] to mix file and socket
operations in single-thread servers.

Since NT 3.1 IO completion was indicated by either an event flag set or APC. Internally it uses a kernel mode APC to wake up the thread,
which then cleans up after the IO and the last thing that APC
does is either repost itself to the thread as a user mode APC
or it sets the requested event flag and deletes itself.
Later they added IO Completion Ports.

But often layered software packages didn't support this which is
why I make direct Windows OS calls whenever possible.

[*] like select() or poll() in Unix. For a long time the Windows "WaitFor..." calls could NOT directly monitor sockets (sockets were
not files), but they could could monitor user events, and both the
file and socket APIs supported using completion events.

The problem with the event approach is that WaitForXxx only allows up to
64 wait objects, and you would need 2 events per network connection,
one for send and one receive. That WaitFor limit in turn forces the thread-per-client model.

Whereas I want a server to wait for arbitrary numbers of clients IO's, hundreds, thousands, tens of thousands..., as many as kernel memory allows.

That's not to say that completion routines don't have their idiosyncrasies. Like everything in Windows land, you have to discover these.

I originally used named pipes between Windows machines.
But if I was using sockets I would have used direct calls to WSA
like WSARecv and WSASend which do support completion routines,
and not used the standard socket interface libraries.

https://learn.microsoft.com/en-us/windows/win32/winsock/winsock-functions https://learn.microsoft.com/en-us/windows/win32/api/Winsock2/nf-winsock2-wsarecv

None of my Windows code will ever be ported to another platform so compatibility with *nix is irrelevant and the Linux AIO functions
are completely different anyway.

APCs might have been more efficient, but I only ever used them in
conjunction with threads - I never tried to write a single-thread
server that performed operations on multiple files or sockets using
only APC.

I don't use them for execution speed efficiency, I use them for the
internal code structure they allow and to minimize kernel resource usage.

One still needs a pool of worker threads to deal with things like
CreateFile which does not support async file open and waits the
calling thread until finished, which would be disaster for a server
thread and can hang a client GUI interface too.

Also there are functions like WSAAccept or closesocket which don't support completion routines and look like they can potentially block/hang so you
can't do everything through completion callbacks.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Chris M. Thomasson on Sun Jun 2 10:49:23 2024

Chris M. Thomasson wrote:

It's a bit difficult for me to remember right now, but did you use
Kernel APC's? Iirc, the pthread-win32 lib used kernel APC's to help
emulate breaking into a thread at any time. Think of PTHREAD_CANCEL_ASYNCHRONOUS. Here is the lib:

https://sourceware.org/pthreads-win32/

Way back, I used this lib all the time. However, I never used pthread cancellation. Just never liked it.

[...]

No, I used the normal alertable (synchronous) user mode APC's.

At some point MS added real asynchronous user mode APC's to the kernel,
but for some bizarre reason they hobbled it so only they could use them.

I just noticed now that Windows 11 has added a function to allow
general posting of asynchronous user mode APC's.

In Win11, see Special user-mode APCs in QueueUserAPC2 function https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-queueuserapc2

and it seems that it is used to perform pthread_cancel.

https://repnz.github.io/posts/apc/user-apc/#ntqueueapcthreadex2-some-new-friends-in-the-fast-ring

When I got the first beta about 30 years ago, I noticed this missing functionality so a made my own using SuspendThread, GetThreadContext, SetThreadContext, and ResumeThread, which edits the thread context
and its stack to force a subroutine call onto one of my thread's stack.
Very dangerous because pretty much none of Win32 code was fully reentrant.

Looking at pThreads thread_cancel you can see them doing something similar
in ptw32_RegisterCancelation if QueueUserAPCEx routine is not available.

ftp://sourceware.org/pub/pthreads-win32/sources/pthreads-w32-2-9-1-release/pthread_cancel.c

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Mon Jun 10 02:46:16 2024

On Mon, 13 May 2024 21:16:48 +0000, MitchAlsup1 wrote:

Emulation is slow when trap overhead is large and not-slow when trap
overhead is small.

I think it was a particular version of the old Mac OS, from around 1990 or
so, that implemented a really amazing hack. Some 32-bit machines had
hardware floating-point, others didn’t. So developers of numerics-
intensive apps had to build two versions of their code, one with the floating-point instructions, the other with calls to Apple’s SANE library.

The hack involved running code built to use hardware floating-point instructions, on hardware that didn’t have them. The instructions were of course trapped and emulated. But more than that, the system would patch
the instruction that caused the trap, turning it into a direct call into
the emulation routine. So after the first execution, each such instruction would run much faster. Until the code got unloaded from RAM and the patch
was lost, of course.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Mon Jun 10 08:07:28 2024

Lawrence D'Oliveiro wrote:

On Mon, 13 May 2024 21:16:48 +0000, MitchAlsup1 wrote:

Emulation is slow when trap overhead is large and not-slow when trap
overhead is small.

I think it was a particular version of the old Mac OS, from around 1990 or so, that implemented a really amazing hack. Some 32-bit machines had
hardware floating-point, others didnâ€™t. So developers of numerics- intensive apps had to build two versions of their code, one with the floating-point instructions, the other with calls to Appleâ€™s SANE library.

The hack involved running code built to use hardware floating-point instructions, on hardware that didnâ€™t have them. The instructions were of
course trapped and emulated. But more than that, the system would patch
the instruction that caused the trap, turning it into a direct call into
the emulation routine. So after the first execution, each such instruction would run much faster. Until the code got unloaded from RAM and the patch
was lost, of course.

This only works when each FP instruction is at least as long as a
function call. This particular approach was standard on PCs more or less
from the very beginning (i.e. 1981++):

You could build applicatons with direct 8087 instructions, with pure sw emulation via CALL FDIV_emulation etc, or in a mode where each emitted
hw fp instruction was followed by enough NOPs to make the total length
at least 5 bytes: This way the missing HW trap handler could patch them
into CALLs (possibly followed by one or more NOPS if the HW opcode was
very long) instead.

Since all those 8087 instructions were _very_ slow (30-300 clock
cycles?), executiong an extra NOP or two made no discernible difference.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Terje Mathisen on Mon Jun 10 11:20:14 2024

On 2024-06-10 9:07, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Mon, 13 May 2024 21:16:48 +0000, MitchAlsup1 wrote:

Emulation is slow when trap overhead is large and not-slow when trap
overhead is small.

I think it was a particular version of the old Mac OS, from around
1990 or
so, that implemented a really amazing hack. Some 32-bit machines had
hardware floating-point, others didnâ€™t. So developers of numerics-
intensive apps had to build two versions of their code, one with the
floating-point instructions, the other with calls to Appleâ€™s SANE
library.

The hack involved running code built to use hardware floating-point
instructions, on hardware that didnâ€™t have them. The instructions
were of
course trapped and emulated. But more than that, the system would patch
the instruction that caused the trap, turning it into a direct call into
the emulation routine. So after the first execution, each such
instruction
would run much faster. Until the code got unloaded from RAM and the patch
was lost, of course.

This only works when each FP instruction is at least as long as a
function call. This particular approach was standard on PCs more or less
from the very beginning (i.e. 1981++):

I believe that the same approach (trap and patch) was used in the HP
2100 computers that I used in the early 1980's, which were designed in
the 1960's. I don't think any NOPs were needed to match instruction
lengths for these machines.

You can also do it the other way around: always compile a function call,
but on a machine that has an FPU use a dummy emulation library that back-patches the call to become an FPU instruction, so that each
emulation function is called at most once.

To be honest, I'm not sure which way around the HP 2100 used.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Niklas Holsti on Tue Jun 11 05:58:21 2024

On Mon, 10 Jun 2024 11:20:14 +0300, Niklas Holsti wrote:

You can also do it the other way around: always compile a function call,
but on a machine that has an FPU use a dummy emulation library that back-patches the call to become an FPU instruction, so that each
emulation function is called at most once.

To be honest, I'm not sure which way around the HP 2100 used.

Hmm, maybe that was the way round it was done on the Mac as well, and I am misremembering.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	64:56:02
Calls:	12,446
Calls today:	1
Files:	15,194
Messages:	6,537,548

Making Lemonade (Floating-point format changes)

Who's Online

Recent Visitors

System Info