I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.
I've defined how the 256-bit internal format floats
can be doubled up to make a 512-bit float.
I'm not really sure such floating-pont precision is useful, but I do
remember some people telling me that higher float precision is indeed something to be desired. Well, the IEEE 754 standard has forced my
hand.
I've made another long-overdue change in the Concertina II
architecture on the page about 17-bit instructions.
Since I describe the individual instructions there, with their opcodes
and what they do, I've illustrated the floating-point formats of the architecture on that page.
The good people in charge of the IEEE 754 standard had seen fit to
define a standard 128-bit floating-point format which included a
hidden first bit.
This annoyed me greatly, because I was going to take the 8087's
temporary real format, and extend the mantissa for my 128-bit format.
I've decided that it's necessary to fully accept the 128-bit standard
and support it in a consistent manner.
Therefore, I have taken the following actions:
I have dropped the option of supporting 80-bit temporary reals
entirely, as they are now incompatible as an internal format.
I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.
And in addition, just as the IBM 704 used two single-precision floats
to make a double-precision float, and the IBM System/360 Model 85
started using two double-precision floats to make an extended
precision float... I've defined how the 256-bit internal format floats
can be doubled up to make a 512-bit float.
I'm not really sure such floating-point precision is useful, but I do remember some people telling me that higher float precision is indeed something to be desired. Well, the IEEE 754 standard has forced my
hand.
John Savard <[email protected]d> schrieb:
I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.
Why not the IEEE binary256 (interchange) format?
https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format
Now that means I need to make the exponent field in my internal
format larger, define a 512-bit floating point number which is in the >internal format, so that it can be unnormalized, and a 1024-bit
doubled-up float... instead of what I just did!
I'm not really sure such floating-pont precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired.
On Sun, 12 May 2024 13:46:28 -0000 (UTC), Thomas Koenig <[email protected]> wrote:
John Savard <[email protected]d> schrieb:
I have instead defined a 256-bit format for floats which does not have
a hidden first bit, which looks like the old temporary reals, except
that the exponent field is one bit wider.
Why not the IEEE binary256 (interchange) format?
https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format
Oh, drat. I had not realized that they had also defined this.
Now that means I need to make the exponent field in my internal
format larger, define a 512-bit floating point number which is in the internal format, so that it can be unnormalized, and a 1024-bit
doubled-up float... instead of what I just did!
The enlarged exponent field won't make the internal form of the
128-bit float go over 160 bits, so register allocation for th at won't change... but now I will have to figure out a scheme of register
allocation applicable to the 256-bit floats!
I am not amused.
John Savard
In article <[email protected]>, [email protected]d (John Savard) wrote:
I'm not really sure such floating-pont precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired.
I would be in favour of 128-bit being available.
I'm not sure my field
has need for 256- or 512-bit, but that doesn't mean that nobody has.
Question:: why are you all so gung-ho on having a format without a hidden bit. >It is trivially easy to reconstruct::
h = expon != 0;
Taking little time of even gates; and is something you HAVE to do anyway.
John Dallman <[email protected]> schrieb:
In article <[email protected]>, [email protected]d (John Savard) wrote:
I'm not really sure such floating-pont precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired.
I would be in favour of 128-bit being available.
Me, too. Solving tricky linear systems, or obtaining derivatives
numerically (for example for Jacobians) eats up a _lot_ of precision
bits, and double precision can sometimes run into trouble.
At least gcc and gfortran now support POWER's native 128-bit format
in hardware. On other systems, software emulation is used, which
is of course much slower.
I'm not sure my field
has need for 256- or 512-bit, but that doesn't mean that nobody
has.
I've finally found the time to play around with Julia in the last
few weeks. One of the nice things does is that you can just use
the same packages with different numerical types, for example for
ODE integration. Just set up the problem as you would normally
and supply an starting vector with a different precision.
So, for doing some experiments on numerical data types, Julia
is quite nice.
On Sun, 12 May 2024 20:55:03 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
John Dallman <[email protected]> schrieb:
In article <[email protected]>,
[email protected]d (John Savard) wrote:
I'm not really sure such floating-pont precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired.
I would be in favour of 128-bit being available.
Me, too. Solving tricky linear systems, or obtaining derivatives
numerically (for example for Jacobians) eats up a _lot_ of precision
bits, and double precision can sometimes run into trouble.
At least gcc and gfortran now support POWER's native 128-bit format
in hardware. On other systems, software emulation is used, which
is of course much slower.
Much slower?
I think, at least for matrix multiplication, my emulation on modern x86
was within factor of 1.5x from your measurements on POWER9.
Emulation via traps is very slow, but typical for many ISA's is to just quietly turn the soft-float operations into runtime calls.
On 5/13/2024 4:16 PM, MitchAlsup1 wrote:
BGB wrote:
Emulation via traps is very slow, but typical for many ISA's is to
just quietly turn the soft-float operations into runtime calls.
I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.
On an x86 this would be at least 200 cycles just getting there and back.
I guess there are different possibilities here...
Trap cost can be reduced, say, by having banked registers.
But, not so good with explicit save/restore and a large register file.
For example, I can note that a MSP430 at 16MHz can service a 32kHz
timer... (with a budget of 488 cycles per interrupt).
But, my BJX2 core (at 50MHz) would have a harder time here, with around
a 1.5k cycle budget...
Then again, it is possible the per-interrupt overhead would go down
slightly, since most likely the ISR stack will still be in the L1 cache between interrupts (and save/restore overhead should drop to ~ 100
cycles in the absence of L1 misses).
MSP430 had a slight advantage here (besides fewer registers) in that L1 misses are not a thing (so, memory access has constant latency).
So, to revisit your statement::
Emulation is slow when trap overhead is large and not-slow when trap
overhead
is small.
Possible, but I would not expect trap overhead to be lower than runtime
call overhead...
Also (in my case):
Debugging is rather annoying in cases where dealing with bugs appear/disappear/move around at random or with the slightest
perturbation...
But, given for the most part behavior is consistently buggy (and
manifesting in seemingly the same ways) between both the emulator and
Verilog implementation, this implies the causal factors are in software.
I guess in this case, either I figure it out, or will need to again go
back to cooperative scheduling. Seemingly, using preemptive scheduling
and virtual memory at the same time is particularly unstable (programs
tend to crash on startup or soon after).
Also I may need to rework how page-in/page-out is handled (and or how IO
is handled in general) since if a page swap needs to happen while IO is already in progress (such as a page-miss in the system-call process), at present, the OS is dead in the water (one can't access the SDcard in the middle of a different access to the SDcard).
On 5/13/2024 6:25 PM, MitchAlsup1 wrote:
Also (in my case):
Debugging is rather annoying in cases where dealing with bugs
appear/disappear/move around at random or with the slightest
perturbation...
You need better verification--Oh Wait ...
Not sure I understand what you mean by this.
Some of these bugs are behaving very similar to some bugs I was battling against a while ago (but never properly debugged, the bug just sort of seemingly disappeared).
Also I may need to rework how page-in/page-out is handled (and or how
IO is handled in general) since if a page swap needs to happen while
IO is already in progress (such as a page-miss in the system-call
process), at present, the OS is dead in the water (one can't access
the SDcard in the middle of a different access to the SDcard).
Having a HyperVisor helps a lot here, with HV taking the page faults
of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it also abstracts away the SDcard interface.
I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.
On an x86 this would be at least 200 cycles just getting there and back.
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and or how
IO is handled in general) since if a page swap needs to happen while
IO is already in progress (such as a page-miss in the system-call
process), at present, the OS is dead in the water (one can't access
the SDcard in the middle of a different access to the SDcard).
Having a HyperVisor helps a lot here, with HV taking the page faults
of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it also
abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
So, having a GuestOS in a position it cannot deal with another page
fault is no longer a hindrance:: GuestOS does not see that page fault;
it is just handled and goes away.
[email protected] (MitchAlsup1) writes:
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and
or how IO is handled in general) since if a page swap needs to
happen while IO is already in progress (such as a page-miss in
the system-call process), at present, the OS is dead in the
water (one can't access the SDcard in the middle of a different
access to the SDcard).
Having a HyperVisor helps a lot here, with HV taking the page
faults of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the
I/O device hardware directly and initiates DMA transactions
to-or-from the guest OS directly. With the PCIe PRI (Page Request Interface), the guest DMA target pages don't need to be pinned by the hypervisor; the I/O MMU will interrupt the hypervisor to make the
page present and pin it and the hardware will then do the DMA.
So, having a GuestOS in a position it cannot deal with another page
fault is no longer a hindrance:: GuestOS does not see that page
fault; it is just handled and goes away.
There are two levels of page faults - at the guest level, the
guest handles everything. When the hypervisors supports
multplexing multple guests on a core, it will only handle second
level translation table faults.
On Tue, 14 May 2024 13:49:09 GMT
[email protected] (Scott Lurndal) wrote:
[email protected] (MitchAlsup1) writes:
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and
or how IO is handled in general) since if a page swap needs to
happen while IO is already in progress (such as a page-miss in
the system-call process), at present, the OS is dead in the
water (one can't access the SDcard in the middle of a different
access to the SDcard).
Having a HyperVisor helps a lot here, with HV taking the page
faults of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the
I/O device hardware directly and initiates DMA transactions
to-or-from the guest OS directly. With the PCIe PRI (Page Request
Interface), the guest DMA target pages don't need to be pinned by the
hypervisor; the I/O MMU will interrupt the hypervisor to make the
page present and pin it and the hardware will then do the DMA.
Sounds like it could be problematic from real-time perspective.
When I design PCIe devices, I sometimes have device-side FIFO
sufficient for 2-5 times an expected worst-case PCIe latency, i.e.
for 7-10 usec or so. In scenario, you describe, it could easily overflow
for acquisition-type device or underflow for player-type device.
Now, my devices are not intended to be plugged into visualized server,
but I'd think that I am not the only designer that choses size of FIFOs
by that sort of logic.
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.
Which MIPS? R2000? R10000? Something else? Was this an inverted page
table?
R3000 and it was a hast table ~1MB in size.
[email protected] (MitchAlsup1) writes:
I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.
Which MIPS? R2000? R10000? Something else? Was this an inverted page
table?
On an x86 this would be at least 200 cycles just getting there and back.
Which x86? 8086? 80186? 80286? These (maybe the 8088 and V20, too)
are the only implementations that deserve to be called x86. If you
mean some IA-32 or AMD64 implementations, which ones?
Anyway, let's see how this works for the U74 (a RISC-V implementation
which apparently uses trapping for unaligned loads); here we have a
10M iteration loop with a payload that performs one load per
iteration:
[fedora-starfive:~/nfstmp/gforth-riscv:104544] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye'
Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned to x x x ! x foo drop bye':
223805151 instructions:u # 0.70 insn per cycle
318131306 cycles:u
0.352533487 seconds time elapsed
0.257061000 seconds user
0.064265000 seconds sys
[fedora-starfive:~/nfstmp/gforth-riscv:104545] perf stat -e instructions -e cycles gforth-fast -e ': foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye'
Performance counter stats for 'gforth-fast -e : foo 10000000 0 do @ loop ; 0 value x here aligned 1+ to x x x ! x foo drop bye':
5329494415 instructions:u # 0.75 insn per cycle
7149481783 cycles:u
7.183239751 seconds time elapsed
7.082298000 seconds user
0.070121000 seconds sys
So the unaligned access handling result in 511 additional instructions
per load compared to an aligned access (so it obviously does the
handling using some kind of trapping). Each unaligned access results
in 683 additional cycles.
So better use the unspecified MIPS, right? However, if the
unspecified MIPS is an R2000, 19 cycles on a 12.5MHz R2000 cost
1.52us, whereas 683 cycles on a 1000MHz U74 cost 0.683us (and I have
heard that in the Visionfive V2 the U74 runs at 1500MHz).
- anton
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
I recall that MIPS could emulate a TLB table walk in something like
19 cycles. That is:: a few cycles to get there, a hash table access,
a check, a TLB install, and a few cycles to get back.
Which MIPS? R2000? R10000? Something else? Was this an inverted page
table?
R3000 and it was a hast table ~1MB in size.
Would would have been a significant fraction (25%?)of the
total memory available on a R3k based system in 1990.
Michael S <[email protected]> schrieb:
On Sun, 12 May 2024 20:55:03 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
John Dallman <[email protected]> schrieb:
In article <[email protected]>,
[email protected]d (John Savard) wrote:
I'm not really sure such floating-pont precision is useful, but
I do remember some people telling me that higher float
precision is indeed something to be desired.
I would be in favour of 128-bit being available.
Me, too. Solving tricky linear systems, or obtaining derivatives
numerically (for example for Jacobians) eats up a _lot_ of
precision bits, and double precision can sometimes run into
trouble.
At least gcc and gfortran now support POWER's native 128-bit format
in hardware. On other systems, software emulation is used, which
is of course much slower.
Much slower?
I think, at least for matrix multiplication, my emulation on modern
x86 was within factor of 1.5x from your measurements on POWER9.
I don't remember the exact timing, and it might be interesting to
revisit that (also considering that the
gfortran code for matmul is
not optimized for 128-bit float and might have blown cache sizes,
plus it would be fair to compare compiler vs. compiler and assembler
vs. assembler).
I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.
What can your code get on x86_64?
If you're building an SR-IOV device, you obviously need to build
it to support the required workloads.
On 12/05/2024 05:44, John Savard wrote:
I've made another long-overdue change in the Concertina II
architecture on the page about 17-bit instructions.
Since I describe the individual instructions there, with their
opcodes and what they do, I've illustrated the floating-point
formats of the architecture on that page.
The good people in charge of the IEEE 754 standard had seen fit to
define a standard 128-bit floating-point format which included a
hidden first bit.
This annoyed me greatly, because I was going to take the 8087's
temporary real format, and extend the mantissa for my 128-bit
format.
I've decided that it's necessary to fully accept the 128-bit
standard and support it in a consistent manner.
Therefore, I have taken the following actions:
I have dropped the option of supporting 80-bit temporary reals
entirely, as they are now incompatible as an internal format.
I have instead defined a 256-bit format for floats which does not
have a hidden first bit, which looks like the old temporary reals,
except that the exponent field is one bit wider.
And in addition, just as the IBM 704 used two single-precision
floats to make a double-precision float, and the IBM System/360
Model 85 started using two double-precision floats to make an
extended precision float... I've defined how the 256-bit internal
format floats can be doubled up to make a 512-bit float.
I'm not really sure such floating-point precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired. Well, the IEEE 754 standard has
forced my hand.
YES, I'd use something similar:
I never cared nor supported any odd 10 byte formats and I give a fart
to all these weird IEEE standards.
On Tue, 14 May 2024 15:57:59 GMT
[email protected] (Scott Lurndal) wrote:
If you're building an SR-IOV device, you obviously need to build
it to support the required workloads.
I am building PCIe device+driver that is unaware of SR-IOV.
I think that Mitch was operating under the same assumption in the post
to which you responded.
IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix multiplication benchmark running on a single POWER9 core.
I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
GHz) using my plug-in replacements for gcc __multf3/__addtf3
I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.
So, a bottleneck is somewhere else. May be, multiplication?
Michael S <[email protected]> schrieb:
IIRC, you reported something like 200 (or 300?) MFLOPS for your
matrix multiplication benchmark running on a single POWER9 core.
Just reran the tests, it gave me somewhere around 405-410 MFlops
on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
This is with the standard gfortran matmul routine.
I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
GHz) using my plug-in replacements for gcc __multf3/__addtf3
Scaled to frequency, the hardware implementation on POWER is then
better by a factor of around four. Not too bad, actually.
[..]
I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.
So, a bottleneck is somewhere else. May be, multiplication?
I messed up the name of the instruction. What I meant was xsmaddqp
(just trips off the tounge, doesn't it?), which on POWER9 actually
has a throughput of 1/13 per cycle, a big, fat instruction,
obviously. On POWER10, this actually got worse, with performance
dropping to 1/18 per cycle, with a latency of 25 cycles. Hm,
apparently somebody didn't think it was all that important,
apparently :-(
[email protected] (MitchAlsup1) writes:
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and or how >>>>> IO is handled in general) since if a page swap needs to happen while >>>>> IO is already in progress (such as a page-miss in the system-call
process), at present, the OS is dead in the water (one can't access
the SDcard in the middle of a different access to the SDcard).
Having a HyperVisor helps a lot here, with HV taking the page faults
of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it also
abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O device hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.
So, having a GuestOS in a position it cannot deal with another page
fault is no longer a hindrance:: GuestOS does not see that page fault;
it is just handled and goes away.
There are two levels of page faults - at the guest level, the
guest handles everything. When the hypervisors supports
multplexing multple guests on a core, it will only handle second
level translation table faults.
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and or how >>>>>> IO is handled in general) since if a page swap needs to happen while >>>>>> IO is already in progress (such as a page-miss in the system-call
process), at present, the OS is dead in the water (one can't access >>>>>> the SDcard in the middle of a different access to the SDcard).
Having a HyperVisor helps a lot here, with HV taking the page faults >>>>> of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it also >>>> abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.
This was something I was not aware of but probably should have anticipated.
GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.
This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.
I guess I should have anticipated this. Sorry !!
MitchAlsup1 wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and or >>>>>>> how IO is handled in general) since if a page swap needs to happen >>>>>>> while IO is already in progress (such as a page-miss in the
system-call process), at present, the OS is dead in the water (one >>>>>>> can't access the SDcard in the middle of a different access to the >>>>>>> SDcard).
Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>> of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.
This was something I was not aware of but probably should have anticipated. >>
GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.
This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.
I guess I should have anticipated this. Sorry !!
The reason OS's pin the pages before the IO starts is so there is no
latency reading in from a device, which then has to buffer the input.
An HDD seek avg about 9 ms, add 3 ms for the page fault code.
A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and or
how IO is handled in general) since if a page swap needs to happen >>>>>> while IO is already in progress (such as a page-miss in the
system-call process), at present, the OS is dead in the water (one >>>>>> can't access the SDcard in the middle of a different access to the >>>>>> SDcard).
Having a HyperVisor helps a lot here, with HV taking the page faults >>>>> of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.
This was something I was not aware of but probably should have anticipated.
GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.
This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.
I guess I should have anticipated this. Sorry !!
MitchAlsup1 wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and or >>>>>>> how IO is handled in general) since if a page swap needs to happen >>>>>>> while IO is already in progress (such as a page-miss in the
system-call process), at present, the OS is dead in the water (one >>>>>>> can't access the SDcard in the middle of a different access to the >>>>>>> SDcard).
Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>> of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it
also abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the I/O
device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.
This was something I was not aware of but probably should have anticipated. >>
GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.
This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page virtualization.
I guess I should have anticipated this. Sorry !!
The reason OS's pin the pages before the IO starts is so there is no
latency reading in from a device, which then has to buffer the input.
An HDD seek avg about 9 ms, add 3 ms for the page fault code.
A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.
What would likely happen is the Ethernet card buffer would fill up
then it starts tossing packets, while it waits for HV to page fault
the receive buffer in from its page file. Later when the guest OS
buffer has faulted in and the card's buffer is emptied, the network
software will eventually NAK all the tossed packets and they get resent.
So there is a stutter every time the HV recycles that guest OS memory
that requires retransmissions to fix. And this is basically using the
senders memory to buffer the transmission while this HV page faults.
Note there are devices, like A to D converters which cannot fix the
tossed data by asking for a retransmission. Or devices like tape drives
which can rewind and reread but are verrry slow about it.
I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.
EricP wrote:
MitchAlsup1 wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
BGB wrote:
Also I may need to rework how page-in/page-out is handled (and >>>>>>>> or how IO is handled in general) since if a page swap needs to >>>>>>>> happen while IO is already in progress (such as a page-miss in >>>>>>>> the system-call process), at present, the OS is dead in the
water (one can't access the SDcard in the middle of a different >>>>>>>> access to the SDcard).
Having a HyperVisor helps a lot here, with HV taking the page faults >>>>>>> of the OS page fault handler.
Seems like adding another layer couldn't help with this, unless it >>>>>> also abstracts away the SDcard interface.
With a HV, GuestOS does not "do" IO is paravirtualizes it via HV.
Actually, that's not completely accurate. With PCI Express SR-IOV,
an I/O MMU and hardware I/O virtualization, the guest accesses the
I/O device
hardware directly and initiates DMA transactions to-or-from the
guest OS directly. With the PCIe PRI (Page Request Interface), the
guest DMA target pages don't need to be pinned by the hypervisor; the
I/O MMU will interrupt the hypervisor to make the page present
and pin it and the hardware will then do the DMA.
This was something I was not aware of but probably should have
anticipated.
GuestOS initiates an I/O request (command) using a virtual function.
Rather than going through a bunch of activities to verify the user
owns the page and it is present, GuestOS just launches request and
then the I/O device page faults and pins the required page (if it is
not already so)--much like the page fault volcano when a new process
begins running:: page faulting in .text, the stack, and data pages
as they get touched.
This way, GuestOS simply considers all pages in its "portfolio" to be
present in memory, and HV does the heavy lifting and page
virtualization.
I guess I should have anticipated this. Sorry !!
The reason OS's pin the pages before the IO starts is so there is no
latency reading in from a device, which then has to buffer the input.
An HDD seek avg about 9 ms, add 3 ms for the page fault code.
A 100 Mbs Ethernet can receive 10 MB/s or 10 kB/ms, = 120 kB in 12 ms.
What would likely happen is the Ethernet card buffer would fill up
then it starts tossing packets, while it waits for HV to page fault
the receive buffer in from its page file. Later when the guest OS
buffer has faulted in and the card's buffer is emptied, the network
software will eventually NAK all the tossed packets and they get resent.
So there is a stutter every time the HV recycles that guest OS memory
that requires retransmissions to fix. And this is basically using the
senders memory to buffer the transmission while this HV page faults.
Note there are devices, like A to D converters which cannot fix the
tossed data by asking for a retransmission. Or devices like tape drives
which can rewind and reread but are verrry slow about it.
I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.
So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.
Does this mean the application has lost a level of indirection in order
to have become virtualized ?????
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
Scott Lurndal wrote:I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.
So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.
Does this mean the application has lost a level of indirection in order
to have become virtualized ?????
I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.
I don't understand your question.
On 5/17/2024 12:26 PM, EricP wrote:
MitchAlsup1 wrote:
So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.
Does this mean the application has lost a level of indirection in order
to have become virtualized ?????
I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.
For some reason this made me think about getting a blue screen of death
due to too much non-paged memory being used by too many concurrent
overlapped IO's on Windows.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
Scott Lurndal wrote:I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.
So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.
Does this mean the application has lost a level of indirection in order
to have become virtualized ?????
I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.
I don't understand your question.
Most of users files are on the local system and SR-IOV works fine, but one
or more of his files exist on a remote machine accessed over the internet; >and user still uses SR-IOV interface to access those files.
How does the system provide the "file is local" illusion to a user having >SR-IOV access to a non-local file.
For example, user opens a file which is an ln-s (a block containing a URL
to where the file is remotely stored) but user thinks file is local.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
Scott Lurndal wrote:I would want an option in this SR-IOV mechanism for the guest app to
tell the guest OS to tell the HV to pin the buffer before starting IO.
So, what happens if GuestOS thinks the user file is located on a local
SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different
system where the file is local, accessed and data returned over the
network.
Does this mean the application has lost a level of indirection in order
to have become virtualized ?????
I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.
Most of users files are on the local system and SR-IOV works fine, but one
or more of his files exist on a remote machine accessed over the internet; and user still uses SR-IOV interface to access those files.
How does the system provide the "file is local" illusion to a user having SR-IOV access to a non-local file.
For example, user opens a file which is an ln-s (a block containing a
URL to where the file is remotely stored) but user thinks file is local.
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
Scott Lurndal wrote:I would want an option in this SR-IOV mechanism for the guest app to >>>>> tell the guest OS to tell the HV to pin the buffer before starting IO. >>>>
So, what happens if GuestOS thinks the user file is located on a local >>>> SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different >>>> system where the file is local, accessed and data returned over the
network.
Does this mean the application has lost a level of indirection in order >>>> to have become virtualized ?????
I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.
Most of users files are on the local system and SR-IOV works fine, but one >> or more of his files exist on a remote machine accessed over the internet; >> and user still uses SR-IOV interface to access those files.
If by "works fine" you mean is slower and has more overhead than
just pinning the pages first as DMA I/O does now.
(Its more work to initiate the I/O, fail when it attempts to DMA,
interrupt cpu, run ISR which queues a DPC which queues an APC back to
the thread, which pins the pages, then restarts the I/O,
than it is to just pin the pages and start and I/O.)
How does the system provide the "file is local" illusion to a user having
SR-IOV access to a non-local file.
For example, user opens a file which is an ln-s (a block containing a
URL to where the file is remotely stored) but user thinks file is local.
I think I see what you are getting at - how does this mechanism
transparently redirect the SR-IOV device request into a network request?
That link is traditionally established at file open inside the kernel file system by cross-linking between a File Control Block (or whatever its called) and a Network Control Block representing that file on the network.
Each file syscall request is sent to the FCB then forwarded to the NCB
and out over a network link.
As I understand it, SR-IOV is a pseudo hardware device control *register*, whereas a disk file is a fictional logical device created by the file
system driver. I don't think one could use SR-IOV to send commands to
local file system software (maybe it could trap into the OS).
One could have *disk controller registers* attached by SR-IOV,
but a disk controller is not a file system.
So as I understand the SR-IOV mechanism, one would not be reading
local or remote files over it under any circumstance.
But my understanding is limited to what the Microsoft driver
documentation says about it.
On 5/18/2024 6:40 AM, EricP wrote:
Chris M. Thomasson wrote:
On 5/17/2024 12:26 PM, EricP wrote:
I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.
For some reason this made me think about getting a blue screen of
death due to too much non-paged memory being used by too many
concurrent overlapped IO's on Windows.
That shouldn't happen as Windows tracks each process's non-paged pool
allocations and quotas and it should return an error when exceeded,
though I've never stress tested it.
I have, wrt NT 4.0 back in the day. It can get to a point where the
system is totally unresponsive. Then, sometimes, dies. A shit load of concurrent overlapped io ops, malloc tends to return NULL, then the
non-paged memory gets really bad...
So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
On Sun, 19 May 2024 11:17:41 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
I.e. your actual running frequency was 3700 MHz?
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
There exists a middle ground between none-pipelined and fully pipelined multiplier/FMA units. In fact, more than one middle ground.
Here the mid-middle ground that can imagine not being a real hardware
guy:
1 - take a pair of exiting VSU multipliers. By now they can do
53x53=>125bit unsigned multiplication. Enhance them to 57x57=>113bit
2 - during quad-precision FMA split 113x113 multiplication into 4
pieces and run them through pair of multiplies each two at once.
That would produce all parts of 225-bit product at rate of 1 product
per 2 clocks
3 - build adders just sufficient for the same throughput of 1 result
per 2 clocks.
Such combined multiplier will have 2 clocks higher latency than DP multiplier.
After that we'll need matching alignment and addition/subtraction
blocks, but by doing them half-pipelined we can utilize majority of
existing dual-DP hardware and would need very little else, except of
control signals and probably of new feedback data path on the upper
side of the adder. All that could cost us another clock of latency over
DP FMA, but not necessarily so.
Bottom line: QP FMA with throughput of 1 result per 2 clocks and
latency of 8 or 9 clocks.
For POWER8, that has less distributed VSU, such modification would be somewhat easier than for POWER9.
That's what I call a mid-middle ground.
Low-middle ground would be leaving 53x53=>125bit multipliers
unmodified. 113x113 multiplication is split into 9 pieces and
product is delivered every 5 clocks.
High-middle ground is enhancing both VSU pipes and using them to
process two QP FMAs simultaneously for combined throughput equivalent
to fully pipelined.
Another possible high-middle ground is, again, enhancing both VSU pipes
and using them together on a single QP FMA. That would be potentially
best for latency, but does not fit well into philosophy of POWER9
design that tries to minimize high-speed interaction between various
pipes.
So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
EricP wrote:
I think I see what you are getting at - how does this mechanism
transparently redirect the SR-IOV device request into a network request?
The other interpretation is that the unprivileged uses is never allowed >direct access to an SR-IOV device--those are restricted to GuestOS (or
more privileged hypervisor threads).
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
Scott Lurndal wrote:I would want an option in this SR-IOV mechanism for the guest app to >>>>> tell the guest OS to tell the HV to pin the buffer before starting IO. >>>>
So, what happens if GuestOS thinks the user file is located on a local >>>> SATA drive, but it is really across some network ?? This works when
devices are not virtualized since the request is routed to a different >>>> system where the file is local, accessed and data returned over the
network.
Does this mean the application has lost a level of indirection in order >>>> to have become virtualized ?????
I don't understand your question.
My comment was about the consequences of not pinning buffer pages
before starting an I/O. If those pages were for a mapped file stored
on a network device it won't be different.
Most of users files are on the local system and SR-IOV works fine, but one >> or more of his files exist on a remote machine accessed over the internet; >> and user still uses SR-IOV interface to access those files.
If by "works fine" you mean is slower and has more overhead than
just pinning the pages first as DMA I/O does now.
(Its more work to initiate the I/O, fail when it attempts to DMA,
interrupt cpu, run ISR which queues a DPC which queues an APC back to
the thread, which pins the pages, then restarts the I/O,
than it is to just pin the pages and start and I/O.)
How does the system provide the "file is local" illusion to a user having
SR-IOV access to a non-local file.
For example, user opens a file which is an ln-s (a block containing a
URL to where the file is remotely stored) but user thinks file is local.
I think I see what you are getting at - how does this mechanism
transparently redirect the SR-IOV device request into a network request?
That link is traditionally established at file open inside the kernel file >system by cross-linking between a File Control Block (or whatever its called) >and a Network Control Block representing that file on the network.
Each file syscall request is sent to the FCB then forwarded to the NCB
and out over a network link.
As I understand it, SR-IOV is a pseudo hardware device control *register*,
whereas a disk file is a fictional logical device created by the file
system driver. I don't think one could use SR-IOV to send commands to
local file system software (maybe it could trap into the OS).
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?
Terje
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
needs to be around 350 bits wide. Mitch knows of course but I'm guessing
that this could at least be close to needing an extra cycle on its own
and/or heroic hardware?
Terje
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before multiplication,
the product of multiplication is 226 bits with two MS
bits != '00'. I don't see how we would ever need more than 229 bits fed
into accumulation phase and into following normalizer.
Of course, all
bits that are lower that LS bit have to be collapsed (by OR) into LS
bit. May be, even less than 229 bits will do, by now I am not sure.
On 5/19/2024 11:37 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
needs to be around 350 bits wide. Mitch knows of course but I'm guessing
that this could at least be close to needing an extra cycle on its own
and/or heroic hardware?
This sort of thing is part of what makes proper FMA hopelessly
expensive.
Granted, full FMA also allows faking higher precision using
SIMD vector operations, with math that does not work with double-rounded
FMA instructions.
Well, and also an issue if one can "just barely" afford to have a single
double-precision unit.
Though, the trick of possibly having four 27-bit multiplies which
combine into a virtual 54 bit multiplier seems like an interesting possibility, though not great as DSP's don't natively handle this size
(and would be too expensive to stretch it out with LUTs). Likely, one
would need to build it from 34*34->68 bit multipliers (each costing 4
DSPs).
In terms of DSP cost, it would be higher than the current solution:
16 vs 6+4 (10).
But, possibly lower LUT cost (in both the Binary32 and Binary64
multipliers, the shortfall is made up using smaller LUT-based
multipliers).
Though, with the combiner option, one could make a case for, say, a:
S.E15.F66.Z46 format (Z=zeroed/ignored).
Well, and/or accept the wonk of a Binary128 which produces 112 bits of mantissa, but only uses the high 66 bits or so, but generally this was
worse for some things in some tests than one which simply zeroes the low-order bits.
But, OTOH, 66*66->112 would allow for possible trickery to fake a full Binary128 FMUL in software as a multi-part process (when combined with a
Binary128 FADD).
....
Terje
On 5/19/2024 4:16 PM, MitchAlsup1 wrote:
BGB wrote:
On 5/19/2024 11:37 AM, Terje Mathisen wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing
that this could at least be close to needing an extra cycle on its
own and/or heroic hardware?
This sort of thing is part of what makes proper FMA hopelessly
expensive.
Getting the LoB correctly rounded showed up the generation prior to
FMAC showing up.
Well, in this case, I have neither in a proper sense.
FMAC operators were sorta faked, but mostly exist because they were
needed for RV64G, but double-rounded (and not able to expose anything
that exists below the ULP, unlike proper FMA).
Granted, full FMA also allows faking higher precision using
SIMD vector operations, with math that does not work with
double-rounded
FMA instructions.
It also enabled error free floating point calculations, but no existing
FP implementation allows exact FP calculations that do not ALSO SET the
inexact flag !?!? {Whereas My 66000 gets this right}
Dunno.
It seems like the existence of anything below the ULP justifies setting
the inexact flag...
Well, and also an issue if one can "just barely" afford to have a
single
double-precision unit.
This is NOT an architectural issue, but an implementation choice issue.
Absent things like microcode or traps, architectural and implementation
choices are closely tied together. Can't have instructions for things
which one can't afford the hardware cost to implement.
Well, and the usefulness of an FPU is dependent on performance.
Inaccurate FPU can still be useful, but slow FPU is not.
Though, the trick of possibly having four 27-bit multiplies which
combine into a virtual 54 bit multiplier seems like an interesting
possibility, though not great as DSP's don't natively handle this size
(and would be too expensive to stretch it out with LUTs). Likely, one
would need to build it from 34*34->68 bit multipliers (each costing 4
DSPs).
This is your implementation choice coloring what you take as
architectural
decisions.
In terms of DSP cost, it would be higher than the current solution:
16 vs 6+4 (10).
But, possibly lower LUT cost (in both the Binary32 and Binary64
multipliers, the shortfall is made up using smaller LUT-based
multipliers).
We can now fit (5nm) hundreds of GBOoO cores on a single die. The
difference between a 53×53 tree and a 64×64 tree (makes all problems
vanish) is
not
visible at this level (100+ cores on a die).
This is your implementation choice coloring you thoughts.
I can afford FPGAs...
I can't afford to get an ASIC made.
So, implementation choices here are:
FPGA;
Nothing.
What kind of car do you drive ??
I don't drive a car...
I tend to fairly rapidly get tired out if trying to drive.
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
Michael S wrote:
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and itThe FMA normalizer has to handle a maximally bad cancellation, so
came to around 18 cycles per FMA. Compared to the 13 cycles for
the FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
it needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra
cycle on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.
Terje
On Sun, 19 May 2024 11:17:41 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
I.e. your actual running frequency was 3700 MHz?
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before >multiplication, the product of multiplication is 226 bits
Michael S <[email protected]> writes:
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
The FMA normalizer has to handle a maximally bad cancellation, so it
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits
The product of the mantissa multiplication is at most 226 bits even if
you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into
play for that case (for something like the product being
0111111... and the addend being -10000000...).
Michael S wrote:
On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and itThe FMA normalizer has to handle a maximally bad cancellation, so
came to around 18 cycles per FMA. Compared to the 13 cycles for
the FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would
be an entirely different beast, I would expect a throughput of
1 per cycle and a latency of (maybe) one cycle more than 64-bit
FMA.
it needs to be around 350 bits wide. Mitch knows of course but
I'm guessing that this could at least be close to needing an
extra cycle on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.
Terje
1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user
code, rather than in test vector, is lower than DP by another order
of magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it
can be achieved by introduction of one more pipeline stage
everywhere. Since we are discussing high-latency design akin to
POWER9, the relative cost of another stage would be lower. BTW,
according to POWER9 manual, even for SP/DP FMA the latency is not
constant. It varies from 5 to 7.
So, IMHO, what you do to handle sub-normal inputs should depend on
what ends up smaller or faster, not on some abstract principles.
For less important unit, like binary128, 'smaller' would likely take relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.
Besides, pre-normalization vs wider post-normalization are not the
only available choices. When multiplier is naturally segmented into
57-bit section, there exists, for example, an option of
pre-normalization by full section. It looks very simple on the
front and saves quite a lot of shifter's width on the back.
But the best option is probably described in above post by Mitch.
If I understood his post correctly, he suggests to have two
alignment stages: one after multiplication and another one after
add/sub. The shift count for a first stage is calculated from
inputs in parallel with multiplication. The first alignment stage
does not try to achieve a perfect normalizations, but it does
enough for cutting the width of following adder from 3N to 2N+eps.
I do agree with Mitch's suggestion: Allow subnormal inputs but do the
partial muls from the top and move the normalization starting point
down for each all-zero input block.
In an extreme case (subnormal x subnormal) this would allow you to
discard a lot of partial products.
Terje
On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and itThe FMA normalizer has to handle a maximally bad cancellation, so
came to around 18 cycles per FMA. Compared to the 13 cycles for
the FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
it needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra
cycle on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.
Terje
1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user code,
rather than in test vector, is lower than DP by another order of
magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it can
be achieved by introduction of one more pipeline stage everywhere.
Since we are discussing high-latency design akin to POWER9, the
relative cost of another stage would be lower. BTW, according to POWER9 manual, even for SP/DP FMA the latency is not constant. It varies from
5 to 7.
So, IMHO, what you do to handle sub-normal inputs should depend on what
ends up smaller or faster, not on some abstract principles. For less important unit, like binary128, 'smaller' would likely take
relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.
Besides, pre-normalization vs wider post-normalization are not the only available choices. When multiplier is naturally segmented into 57-bit section, there exists, for example, an option of pre-normalization by
full section. It looks very simple on the front and saves quite a lot
of shifter's width on the back.
But the best option is probably described in above post by Mitch. If I understood his post correctly, he suggests to have two alignment stages:
one after multiplication and another one after add/sub. The shift count
for a first stage is calculated from inputs in parallel with
multiplication. The first alignment stage does not try to achieve a
perfect normalizations, but it does enough for cutting the width of
following adder from 3N to 2N+eps.
On 5/19/2024 3:08 PM, Chris M. Thomasson wrote:
On 5/19/2024 3:04 PM, Chris M. Thomasson wrote:
On 5/19/2024 2:55 PM, Chris M. Thomasson wrote:
[...]
I remember a little test that Microsoft made wrt 50,000 concurrent
OVERLAPPED ops in IOCP vs an event driven model actually creating a
windows event per connection multiplexing WFMO in several threads.
The event model did not perform as well, but it did not do too bad
either. I wonder if I can still find that paper. Back in 2002 or
something. Hard to remember right now.
I am having trouble finding it. I do remember:
https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc959494(v=technet.10)?redirectedfrom=MSDN
I just found an old post from me back in 2003 with a link to the paper:
___________________
You can get 50,000+ concurrent connections using IOCP, check out the
following link:
http://www.microsoft.com/mspress/books/sampchap/5726a.asp?#128
You do have to do some memory management to get there, like posting zero
byte receives to ensure that pending recvs don't lock their buffers,
you can
also restrict the amount of pending sends the server has all together
[...]
___________________
The way back machine found it, I think!
https://web.archive.org/web/20030216222720/https://www.microsoft.com/mspress/books/sampchap/5726a.asp#128
Nice!
On 5/19/2024 7:10 PM, MitchAlsup1 wrote:
Kahan has several lectures about this....
There have been apparently more things killed off by slow performance
than by lack of FPU accuracy.
Say, at the time, performance apparently killed off:
Amiga (killed off by its slow graphics)
Bit planar graphics rather sucking if one wants fast screen
redraws;
M68K, killed off for being too slow vs x86;
Cyrix, because its Pentium equivalent was slow at running Quake;
...
On Mon, 20 May 2024 14:22:00 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <[email protected]> wrote:
Michael S wrote:
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and itThe FMA normalizer has to handle a maximally bad cancellation, so
came to around 18 cycles per FMA. Compared to the 13 cycles for >>>>>>> the FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as >>>>>>> an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would
be an entirely different beast, I would expect a throughput of
1 per cycle and a latency of (maybe) one cycle more than 64-bit
FMA.
it needs to be around 350 bits wide. Mitch knows of course but
I'm guessing that this could at least be close to needing an
extra cycle on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.
Terje
1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user
code, rather than in test vector, is lower than DP by another order
of magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it
can be achieved by introduction of one more pipeline stage
everywhere. Since we are discussing high-latency design akin to
POWER9, the relative cost of another stage would be lower. BTW,
according to POWER9 manual, even for SP/DP FMA the latency is not
constant. It varies from 5 to 7.
So, IMHO, what you do to handle sub-normal inputs should depend on
what ends up smaller or faster, not on some abstract principles.
For less important unit, like binary128, 'smaller' would likely take
relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.
Besides, pre-normalization vs wider post-normalization are not the
only available choices. When multiplier is naturally segmented into
57-bit section, there exists, for example, an option of
pre-normalization by full section. It looks very simple on the
front and saves quite a lot of shifter's width on the back.
But the best option is probably described in above post by Mitch.
If I understood his post correctly, he suggests to have two
alignment stages: one after multiplication and another one after
add/sub. The shift count for a first stage is calculated from
inputs in parallel with multiplication. The first alignment stage
does not try to achieve a perfect normalizations, but it does
enough for cutting the width of following adder from 3N to 2N+eps.
I do agree with Mitch's suggestion: Allow subnormal inputs but do the
partial muls from the top and move the normalization starting point
down for each all-zero input block.
In an extreme case (subnormal x subnormal) this would allow you to
discard a lot of partial products.
Terje
For subnormal x subnormal you don't need result of multiplication at
all. All you need to know is if it's zero or not and what sign.
Even that is needed only in non-default rounding modes and for inexact
flag in default mode.
On 5/20/2024 7:36 AM, Michael S wrote:
For subnormal x subnormal you don't need result of multiplication at
all. All you need to know is if it's zero or not and what sign.
Even that is needed only in non-default rounding modes and for inexact
flag in default mode.
For most non-tiny formats, the seeming advantage of subnormal numbers
seems small, in any case.
But, yeah, in any case I would almost prefer if there could be a separate/cheaper standard, probably mostly aimed at
embedded/microcontroller style use-cases (rather than "general
purpose"), and would likely relax the requirements a fair bit.
Say, likely target might be, say:
FADD/FSUB/FMUL;
Binary16 and Binary32 as high-priority formats;
Binary64 as optional (but nice to have);
Probably DAZ/FTZ;
Potentially allow for truncate-only rounding.
Assumption being that larger or higher precision cases would fall back
to software emulation.
Could optionally have some 8-bit FP formats, but 8-bit FP is a little
bit too limited for general-purpose use.
Likely main candidates being:
S.E4.F3 (Bias=7)
S.E3.F4 (Bias=7|8, ~ Unit Range)
More or less A-Law without the XOR.
Though, A-Law can also be interpreted as a ~ 12 bit integer value.
Annoyingly, exact bias depends on context for this one
(eg: 8/7/3/0)...
I had also used:
E4.F4
E4.F3.S
But, this is wonky (and the possible merit of E4.F3.S is defeated once
one also needs S.E4.F3 or S.E3.F4, as these are the "actually used in
the wild" formats, so may have been a mistake).
BGB wrote:
On 5/20/2024 7:36 AM, Michael S wrote:
For subnormal x subnormal you don't need result of multiplication at
all. All you need to know is if it's zero or not and what sign.
Even that is needed only in non-default rounding modes and for inexact
flag in default mode.
For most non-tiny formats, the seeming advantage of subnormal numbers
seems small, in any case.
There is, it is called Posit (or UNUM depending).
No subnormals, wider range then IEEE, more precision than IEEE
(most of the time). Whether it is better overall is still a
matter of debate. It is harder to implement than IEEE but
just barely.
Granted, they are not necessarily the option one would go if they wanted "cheapest possible FPU that is still good enough to be usable".
Though, the point at which an FPU manages to suck badly enough that one
needs to resort to software emulation to make software work, is probably
a lower limit.
Luckily, "uses 754 formats, but with aggressive cost cutting" can be
"good enough", and so long as they more-or-less deliver a full width mantissa, and can exactly compute exact-value calculations, most
software is generally going to work.
But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so there
is a lower limit here.
As you note, it is only when using RoundToPlus (or Minus) Infinity
that an arbitrary small product can still produce a non-zero result.
Terje
BGB <[email protected]> schrieb:
Granted, they are not necessarily the option one would go if they
wanted "cheapest possible FPU that is still good enough to be
usable".
Though, the point at which an FPU manages to suck badly enough that
one needs to resort to software emulation to make software work, is probably a lower limit.
Luckily, "uses 754 formats, but with aggressive cost cutting" can
be "good enough", and so long as they more-or-less deliver a full
width mantissa, and can exactly compute exact-value calculations,
most software is generally going to work.
This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.
But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
there is a lower limit here.
An example of a more interesting question is
if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}
On Tue, 21 May 2024 05:49:55 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
BGB <[email protected]> schrieb:
Granted, they are not necessarily the option one would go if they
wanted "cheapest possible FPU that is still good enough to be
usable".
Though, the point at which an FPU manages to suck badly enough that
one needs to resort to software emulation to make software work, is
probably a lower limit.
Luckily, "uses 754 formats, but with aggressive cost cutting" can
be "good enough", and so long as they more-or-less deliver a full
width mantissa, and can exactly compute exact-value calculations,
most software is generally going to work.
This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.
But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
there is a lower limit here.
An example of a more interesting question is
if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}
If I am not mistaken, that should hold on VAX, which has floating-point
very close to BGB ideal. It looks like it would hold even on less
robust formats, like IBM's hex float. I wonder where it is not true?
The biggest difference between IEEE and VAX is that on IEEE when (a > b)
then (a - b > 0) while on VAX (a - b >= 0).
Of course, IEEE has non-intuitive cases as well.
if (!(a < 0)) {
if (!(b < 0)) {
if (!(a + b >= a)) {
printf("It's IEEE 754, baby!\n);
}
}
}
Michael S wrote:
On Tue, 21 May 2024 05:49:55 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
BGB <[email protected]> schrieb:
Granted, they are not necessarily the option one would go if they
wanted "cheapest possible FPU that is still good enough to be
usable".
Though, the point at which an FPU manages to suck badly enough
that one needs to resort to software emulation to make software
work, is probably a lower limit.
Luckily, "uses 754 formats, but with aggressive cost cutting" can
be "good enough", and so long as they more-or-less deliver a full
width mantissa, and can exactly compute exact-value calculations,
most software is generally going to work.
This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.
But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
there is a lower limit here.
An example of a more interesting question is
if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}
If I am not mistaken, that should hold on VAX, which has
floating-point very close to BGB ideal. It looks like it would hold
even on less robust formats, like IBM's hex float. I wonder where
it is not true?
The biggest difference between IEEE and VAX is that on IEEE when (a
b) then (a - b > 0) while on VAX (a - b >= 0).
Of course, IEEE has non-intuitive cases as well.
if (!(a < 0)) {
if (!(b < 0)) {
if (!(a + b >= a)) {
printf("It's IEEE 754, baby!\n);
}
}
}
What happens when a and/or b is a NaN?
Comparisons with NaN should return false, so !(NaN < 0) will be true
(and the same for b), while !(NaN+b >= NaN) will also return true.
Is that what you were thinking of?
Terje
But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so there
is a lower limit here.
On Mon, 20 May 2024 21:17:15 +0200
Terje Mathisen <[email protected]> wrote:
As you note, it is only when using RoundToPlus (or Minus) Infinity
that an arbitrary small product can still produce a non-zero result.
Terje
I think, we were discussing multiplication stage of FMA rather than multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).
Errm, I was promoting the idea of cost-cut floating point, not blatantly
broken floating point...
On Tue, 21 May 2024 05:49:55 -0000 (UTC)
Thomas Koenig <[email protected]> wrote:
BGB <[email protected]> schrieb:
Granted, they are not necessarily the option one would go if they
wanted "cheapest possible FPU that is still good enough to be
usable".
Though, the point at which an FPU manages to suck badly enough that
one needs to resort to software emulation to make software work, is
probably a lower limit.
Luckily, "uses 754 formats, but with aggressive cost cutting" can
be "good enough", and so long as they more-or-less deliver a full
width mantissa, and can exactly compute exact-value calculations,
most software is generally going to work.
This will require extensive testing and possibly modification for
a lot of software ported to such a system. This will drive up
the total cost, presumably far more than any hardware savings.
But OTOH, if 1.0+2.0 gives 2.999999, that is, not good enough, so
there is a lower limit here.
An example of a more interesting question is
if (a >= 0.) {
if (b >= 0) {
if (a + b < a) {
printf("We should never get here!\n);
abort();
}
}
}
If I am not mistaken, that should hold on VAX, which has floating-point
very close to BGB ideal. It looks like it would hold even on less
robust formats, like IBM's hex float. I wonder where it is not true?
I think, we were discussing multiplication stage of FMA rather thanImagine, instead, if IEEE 754 had defined positive underflow with the
multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.
Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.
But that ship sailed 50 years ago.
I think, we were discussing multiplication stage of FMA rather thanImagine, instead, if IEEE 754 had defined positive underflow with the
multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.
Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.
But that ship sailed 50 years ago.
Wouldn't that just kick the problem down the street?
For example, what should `x < y` return when both `x` and `y` are "infinity+epsilon"?
Stefan
Currently, FCMPEQ will be false on NaN, whereas FCMPGT ignores NaN's.
I guess possible could be to add an FCMPGE instruction, and then make
both FCMPGT and FCMPGE be false on NaN (where the LT and LE cases can be
handled by flipping the arguments, so would still be false on NaN).
So, as-is:
if(!(a==a))
{
//NaN
}
But, as-is:
if(a>0)
{
//May still potentially get here with NaN
}
BGB wrote:
Errm, I was promoting the idea of cost-cut floating point, not blatantly
broken floating point...
Would you promote the idea where the customer could specify whether his
car had air bags and crash safety cell or not ??
Same point here.
Michael S wrote:
On Mon, 20 May 2024 21:17:15 +0200
Terje Mathisen <[email protected]> wrote:
As you note, it is only when using RoundToPlus (or Minus) Infinity
that an arbitrary small product can still produce a non-zero result.
Terje
I think, we were discussing multiplication stage of FMA rather than
multiplication proper.
In case of FMA, zeroness (zeroity ?) and sign of tiny product matter in
all standard IEEE rounding mode except default (RNE).
Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.
Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.
But that ship sailed 50 years ago.
Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.
Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.
But that ship sailed 50 years ago.
MitchAlsup1 wrote:
Imagine, instead, if IEEE 754 had defined positive underflow with the
result of positive tiny, negative underflow with negative tiny,
positive overflow with positive infinity-epsilon and negative
overflow with negative infinity+epsilon.
Here, the fact overflow or underflow happened is recorded in the
result, and these results remain identifiable from real infinities
or real zeros.
But that ship sailed 50 years ago.
If an overflow occurs and that exception is masked, x86 returns a
value of either +-largest finite number (LFN) or +-infinity (INF),
depending on the rounding mode.
(vol-1 Arch manual, section 4.9.1.4, Numeric Overflow Exception,
Table 4-10. Masked Responses to Numeric Overflow)
Rounding_Mode Sign_of_Result Result
------------- -------------- -------------------------------
To nearest + +∞
– –∞
Toward –∞ + Largest finite positive number
– –∞
Toward +∞ + +∞
– Largest finite negative number
Toward zero + Largest finite positive number
– Largest finite negative number
The difference seems to be that INF is a sticky overflow, LFN is not.
Would this not satisfy everyone?
The problem is that it requires diddling the control register to change
the round mode, as opposed to round mode on each float instruction.
Or maybe even the LFN/INF overflow choice should be a separate option independent of round control bits.
On 5/20/2024 7:28 AM, Terje Mathisen wrote:
Anton Ertl wrote:
Michael S <[email protected]> writes:
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
The FMA normalizer has to handle a maximally bad cancellation, so it >>>>> needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle >>>>> on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits
The product of the mantissa multiplication is at most 226 bits even if
you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into
play for that case (for something like the product being
0111111... and the addend being -10000000...).
This is the part of Mitch's explanation that I have never been able to
totally grok, I do think you could get away with less bits, but only if
you can collapse the extra mantissa bits into sticky while aligning the
product with the addend. If that takes too long or it turns out to be
easier/faster in hardware to simply work with a much wider mantissa,
then I'll accept that.
I don't think I've ever seen Mitch make a mistake on anything like
this!
It is a mystery, though seems like maybe Binary128 FMA could be done in
software via an internal 384-bit intermediate?...
My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to
228 bits. If one adds another 116 bits (for maximal FADD), this comes
to
344.
In this case, 384 bits would be because my "_BitInt" support code pads
things to a multiple of 128 bits (for integer types larger than 256
bits).
It isn't fast, but I am not against having Binary128 being slower,
since
if one is using Binary128 ("long double" or "__float128" in this case),
it is likely the case that precision is more a priority than speed.
Though, as of yet, there is no Binary128 FMA operation (in the software
runtime). Could potentially add this in theory.
I guess, maybe also possible could be whether to add the
FADDX/FMULX/FMACX instructions in a form where they are allowed, but
will be turned into runtime traps (would likely route them through the
TLB Miss ISR, which thus far has ended up as a catch-all for this sort
of thing...).
Though, likely more efficient would still be "just use the runtime
calls".
Terje
BGB-Alt wrote:
On 5/20/2024 7:28 AM, Terje Mathisen wrote:
Anton Ertl wrote:
Michael S <[email protected]> writes:
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <[email protected]> wrote:
The FMA normalizer has to handle a maximally bad cancellation, so it >>>>>> needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle >>>>>> on its own and/or heroic hardware?
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits
The product of the mantissa multiplication is at most 226 bits even if >>>> you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into >>>> play for that case (for something like the product being
0111111... and the addend being -10000000...).
This is the part of Mitch's explanation that I have never been able
to totally grok, I do think you could get away with less bits, but
only if
you can collapse the extra mantissa bits into sticky while aligning the
product with the addend. If that takes too long or it turns out to be
easier/faster in hardware to simply work with a much wider mantissa,
then I'll accept that.
I don't think I've ever seen Mitch make a mistake on anything like
this!
It is a mystery, though seems like maybe Binary128 FMA could be done in
software via an internal 384-bit intermediate?...
My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to
228 bits. If one adds another 116 bits (for maximal FADD), this comes
to
344.
Maximal product with minimal augend::
pppppppp-pppppppp-aaaaaaaa
Maximal augend with minimal product
aaaaaaaa-pppppppp-pppppppp
So the way one builds HW is to have the augend shifter cover the whole
4×
length and place the product in the middle::
max min
aaaaaaaa-aaaaaaaa-aaaaaaaa-aaaaaaaa
pppppppp-pppppppp
The output of the product is still in carry-save form and the augend is
in pure binary so the adder is 3-input for 2×-width. This generates a
carry into the high order incrementor.
So one has a sticky generator for the right hand side augend, and an incrementor for the left hand side augend. When doing high speed de-
normals one cannot count on the left hand side of product to have HoBs
set with standard ramifications (imaging a denorm product and a denorm
augend and you want the right answer.)
Any way you cook it, you have a 4× wide intermediate (minus 2-bits
IIRC).
4×112 = 448 -2 = 446.
There is a reason these things are not standard at this point of
technology.
Could you do it (IEEE accuracy) with less HW--yes, but only if you
allow
certain special cases to take more cycles in calculation. At a certain
point (a point made by Terje) it is easier to implement with wide
integer
calculations 128+128 and/or 128*128 along with double width shifts,
inserts,
and extracts.
IEEE did not make these things any easier by having a 2× std width
fraction have 2×+3 bits of length requiring 8 multiplications with
minimal HW instead of 4 multiplications. On the other hand IBM did us
no favors with Hex FP either (keeping the exponent size the same and
having 2×+8 bits of fraction.)
On 5/20/2024 8:44 AM, EricP wrote:
Chris M. Thomasson wrote:
On 5/19/2024 3:08 PM, Chris M. Thomasson wrote:
On 5/19/2024 3:04 PM, Chris M. Thomasson wrote:
On 5/19/2024 2:55 PM, Chris M. Thomasson wrote:
[...]
I remember a little test that Microsoft made wrt 50,000 concurrent >>>>>> OVERLAPPED ops in IOCP vs an event driven model actually creating
a windows event per connection multiplexing WFMO in several
threads. The event model did not perform as well, but it did not
do too bad either. I wonder if I can still find that paper. Back
in 2002 or something. Hard to remember right now.
I am having trouble finding it. I do remember:
https://learn.microsoft.com/en-us/previous-versions/windows/it-pro/windows-2000-server/cc959494(v=technet.10)?redirectedfrom=MSDN
I just found an old post from me back in 2003 with a link to the paper: >>>> ___________________
You can get 50,000+ concurrent connections using IOCP, check out the
following link:
http://www.microsoft.com/mspress/books/sampchap/5726a.asp?#128
You do have to do some memory management to get there, like posting
zero
byte receives to ensure that pending recvs don't lock their buffers,
you can
also restrict the amount of pending sends the server has all together
[...]
___________________
The way back machine found it, I think!
https://web.archive.org/web/20030216222720/https://www.microsoft.com/mspress/books/sampchap/5726a.asp#128
Nice!
Thanks. I have never used the thread-per-client model in my servers.
I have been using async I/O and Asynchronous Procedure Calls (APC)
for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP),
which were added to Windows later, have similar functionality
(perhaps IOCP might have slightly better scaling with many cores).
I never tried APC wrt IO and a socket server on windows before! I have created several types wrt the book I finally found on the wayback
machine. IOCP was the thing to use.
In my case, BGBCC supports __int128 operations whether or not the ALUX instructions are enabled (along with _BitInt, *1).
*1:
_BitInt(56) x0; //maps to 64-bit
_BitInt(64) x1; //maps to 64-bit
_BitInt(80) x2; //maps to 128-bit
_BitInt(128) x3; //maps to 128-bit
_BitInt(160) x4; //maps to 256-bit
_BitInt(256) x5; //maps to 256-bit
_BitInt(272) x6; //maps to 384-bit
...
All sizes beyond 256 bit mapping to the next integer multiple of 128
bits. The 256-bit type is special, in that it has its own dedicated
logic, but exists via the _BitInt type. For 384 and beyond, generic
logic is used that deals with any size value, but is slower.
Can note that in my implementation, BitInt does not enforce modulo
behavior in the case of overflow (it is modulo only to the size of the container; enforcing odd-bit modulo behavior would add a fair bit of
cost to using them).
Terje
Chris M. Thomasson wrote:
On 5/20/2024 8:44 AM, EricP wrote:
Thanks. I have never used the thread-per-client model in my servers.
I have been using async I/O and Asynchronous Procedure Calls (APC)
for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP), >>> which were added to Windows later, have similar functionality
(perhaps IOCP might have slightly better scaling with many cores).
I never tried APC wrt IO and a socket server on windows before! I have
created several types wrt the book I finally found on the wayback
machine. IOCP was the thing to use.
APC's are I/O completion callback subroutines with 1 to 3 arguments.
I use them to a build callback driven state machines for each network
I/O channel, similar to device drivers. Each network channel requires
only a small amount of user mode memory, and all server network connections >can be serviced by a single thread or a small fixed pool of comms threads. >This keeps the cost for each new connection to mostly just the kernel's >network memory.
WinNT originally only had APC's. It inherited the concept from
VMS's Asynchronous System Trap (AST), which inherited the concept
from RSX-11 on PDP-11.
The difference between Windows APC and RSX/VMS AST is how they are delivered. >Despite the name, Windows user mode APC's are NOT delivered to the thread >asynchronously as interrupts but only at specified delivery points,
which means user mode APC's are essentially a synchronous polled delivery. >whereas VMS AST's are delivered at any time using interrupts semantics. >(Windows does have real asynchronous-delivery APC's but inside the kernel >where they are used to interrupt or wake up a thread for I/O completion
and various other things.)
User mode APC's are simpler from a user mode programming point of view than >VMS's AST's but because APC's have a polled delivery you can't use user mode >APC's to interrupt and force a thread to do something, as AST's can.
On Fri, 24 May 2024 09:43:19 -0400, EricP
<[email protected]> wrote:
Chris M. Thomasson wrote:
On 5/20/2024 8:44 AM, EricP wrote:
APC's are I/O completion callback subroutines with 1 to 3 arguments.Thanks. I have never used the thread-per-client model in my servers.I never tried APC wrt IO and a socket server on windows before! I have
I have been using async I/O and Asynchronous Procedure Calls (APC)
for I/O completion on Windows for 30+ years. IO Completion Ports (IOCP), >>>> which were added to Windows later, have similar functionality
(perhaps IOCP might have slightly better scaling with many cores).
created several types wrt the book I finally found on the wayback
machine. IOCP was the thing to use.
I use them to a build callback driven state machines for each network
I/O channel, similar to device drivers. Each network channel requires
only a small amount of user mode memory, and all server network connections >> can be serviced by a single thread or a small fixed pool of comms threads. >> This keeps the cost for each new connection to mostly just the kernel's
network memory.
WinNT originally only had APC's. It inherited the concept from
VMS's Asynchronous System Trap (AST), which inherited the concept
from RSX-11 on PDP-11.
I can't speak to "originally" as I never used NT3.x, but NT4.x allowed asynchronous I/O calls to signal events on completion (or failure). I
used events with WaitForMultipleObjects [*] to mix file and socket
operations in single-thread servers.
[*] like select() or poll() in Unix. For a long time the Windows "WaitFor..." calls could NOT directly monitor sockets (sockets were
not files), but they could could monitor user events, and both the
file and socket APIs supported using completion events.
APCs might have been more efficient, but I only ever used them in
conjunction with threads - I never tried to write a single-thread
server that performed operations on multiple files or sockets using
only APC.
It's a bit difficult for me to remember right now, but did you use
Kernel APC's? Iirc, the pthread-win32 lib used kernel APC's to help
emulate breaking into a thread at any time. Think of PTHREAD_CANCEL_ASYNCHRONOUS. Here is the lib:
https://sourceware.org/pthreads-win32/
Way back, I used this lib all the time. However, I never used pthread cancellation. Just never liked it.
[...]
Emulation is slow when trap overhead is large and not-slow when trap
overhead is small.
On Mon, 13 May 2024 21:16:48 +0000, MitchAlsup1 wrote:
Emulation is slow when trap overhead is large and not-slow when trap
overhead is small.
I think it was a particular version of the old Mac OS, from around 1990 or so, that implemented a really amazing hack. Some 32-bit machines had
hardware floating-point, others didn’t. So developers of numerics- intensive apps had to build two versions of their code, one with the floating-point instructions, the other with calls to Apple’s SANE library.
The hack involved running code built to use hardware floating-point instructions, on hardware that didn’t have them. The instructions were of
course trapped and emulated. But more than that, the system would patch
the instruction that caused the trap, turning it into a direct call into
the emulation routine. So after the first execution, each such instruction would run much faster. Until the code got unloaded from RAM and the patch
was lost, of course.
Lawrence D'Oliveiro wrote:
On Mon, 13 May 2024 21:16:48 +0000, MitchAlsup1 wrote:
Emulation is slow when trap overhead is large and not-slow when trap
overhead is small.
I think it was a particular version of the old Mac OS, from around
1990 or
so, that implemented a really amazing hack. Some 32-bit machines had
hardware floating-point, others didn’t. So developers of numerics-
intensive apps had to build two versions of their code, one with the
floating-point instructions, the other with calls to Apple’s SANE
library.
The hack involved running code built to use hardware floating-point
instructions, on hardware that didn’t have them. The instructions
were of
course trapped and emulated. But more than that, the system would patch
the instruction that caused the trap, turning it into a direct call into
the emulation routine. So after the first execution, each such
instruction
would run much faster. Until the code got unloaded from RAM and the patch
was lost, of course.
This only works when each FP instruction is at least as long as a
function call. This particular approach was standard on PCs more or less
from the very beginning (i.e. 1981++):
You can also do it the other way around: always compile a function call,
but on a machine that has an FPU use a dummy emulation library that back-patches the call to become an FPU instruction, so that each
emulation function is called at most once.
To be honest, I'm not sure which way around the HP 2100 used.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 37:27:03 |
| Calls: | 12,109 |
| Files: | 15,006 |
| Messages: | 6,518,371 |