Forum: >>> Magnum BBS <<<

Whither the Mill?

From Stephen Fuld@21:1/5 to All on Wed Dec 13 08:25:39 2023

When we last heard from the merry band of Millers, they were looking for substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.

But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for
not huge amounts of money. Whether this is worthwhile, I cannot say.

Anyway, has all development stopped? Or is their "sweat equity" model
still going on?

Inquiring minds want to know.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stephen Fuld on Wed Dec 13 17:32:54 2023

Stephen Fuld <[email protected]d> writes:

When we last heard from the merry band of Millers, they were looking for >substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.

But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for
not huge amounts of money. Whether this is worthwhile, I cannot say.

There might even be some way of renting time on a real
emulator from cadence (Palladium) or synopsys (Zebu).

Although in my experience those who have them use them
24x7.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to [email protected] on Fri Dec 15 12:48:00 2023

On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
<[email protected]d> wrote:

When we last heard from the merry band of Millers, they were looking for >substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.

But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for
not huge amounts of money. Whether this is worthwhile, I cannot say.

Anyway, has all development stopped? Or is their "sweat equity" model
still going on?

Inquiring minds want to know.

There was a post, ostensibly from Ivan, in their web forum just a few
days ago. No news though - just an acknowledgement of another user's
post.

Last I heard, the next (current?) round of financing was - at least in
part - to be used for FPGA "proof of concept" implementations.

Problem is the Mill really is a SoC, and (to me at least) the design
appears to be so complex that it would require a large, top-of-line
(read "expensive") FPGA to fit all the functionality.

Then there is their idea that everything - from VHDL to software build toolchain to system software - be automatically generated from a
simple functional specification. Getting THAT right is likely proving
far more difficult than simply implementing a fixed design in an FPGA.

YMMV,
George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to BGB on Fri Dec 15 16:17:41 2023

BGB wrote:

On 12/15/2023 11:48 AM, George Neuner wrote:

On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
<[email protected]d> wrote:

When we last heard from the merry band of Millers, they were looking for >>> substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.

But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for >>> not huge amounts of money. Whether this is worthwhile, I cannot say.

Anyway, has all development stopped? Or is their "sweat equity" model
still going on?

Inquiring minds want to know.

There was a post, ostensibly from Ivan, in their web forum just a few
days ago. No news though - just an acknowledgement of another user's
post.

Last I heard, the next (current?) round of financing was - at least in
part - to be used for FPGA "proof of concept" implementations.

Problem is the Mill really is a SoC, and (to me at least) the design
appears to be so complex that it would require a large, top-of-line
(read "expensive") FPGA to fit all the functionality.

Yeah. the lower end isn't cheap, the upper end is absurd...

For FPGA's over $1k, almost makes more sense to ignore that they exist
(also this appears to be around the cutoff point for the free version of Vivado as well; but one would have thought Xilinx would have already
gotten their money by someone having bought the FPGA?...).

Found a recent article that says Xilinx prices run from 8$ to $100,
low end Intel fpga's start at $3, but the high end Stratix models
go from $10,000 to $100,000.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Niklas Holsti@21:1/5 to Scott Lurndal on Sat Dec 16 09:22:32 2023

On 2023-12-16 0:39, Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

[snip]

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >> and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB).

Are you both sure that those numbers are really lines per *second*? They
seem improbably high, and compilation speeds in those years used to be
stated in lines per *minute*.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Niklas Holsti on Sat Dec 16 12:14:22 2023

Niklas Holsti <[email protected]d> writes:

On 2023-12-16 0:39, Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

[snip]

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >>> and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB).

Are you both sure that those numbers are really lines per *second*? They
seem improbably high, and compilation speeds in those years used to be
stated in lines per *minute*.

Especially given that 10Klines/s is probably around 500KB/s which has
to be read from disk and probably a similar amount that has to be
written to disk. What were the I/O throughputs available at the time?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Anton Ertl on Sat Dec 16 12:30:47 2023

Anton Ertl <[email protected]> schrieb:

Niklas Holsti <[email protected]d> writes:

On 2023-12-16 0:39, Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

[snip]

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>> and string) and did a pretty good job of spitting out high performance >>>> code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB).

Are you both sure that those numbers are really lines per *second*? They >>seem improbably high, and compilation speeds in those years used to be >>stated in lines per *minute*.

Especially given that 10Klines/s is probably around 500KB/s which has
to be read from disk and probably a similar amount that has to be
written to disk. What were the I/O throughputs available at the time?

It depends a bit how the Fortran and Cobol statements were stored.
If they were stored in punched card format, 80 characters per line,
then it would be 800000 characters per second read. Object code,
probably much less, but the total could still come to around
1 MB/s.

The IBM 3350 (introduced in 1975) is probably fairly representative
of the high end of that era, it had a data transfer speed of 1198
kB/second, and a seek time of 25 milliseconds.

So, 10000 lines/s would almost definitely have been I/O bound at the
time.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Niklas Holsti on Sat Dec 16 15:11:03 2023

Niklas Holsti <[email protected]d> writes:

On 2023-12-16 0:39, Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

[snip]

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >>> and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB).

Are you both sure that those numbers are really lines per *second*? They
seem improbably high, and compilation speeds in those years used to be
stated in lines per *minute*.

Yes, lines per minute is the proper metric. Note that for many
years, the compilation rate was bounded by the speed of the card
reader (300 to 600 cards per minute).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From moi@21:1/5 to Niklas Holsti on Sat Dec 16 18:04:48 2023

On 16/12/2023 07:22, Niklas Holsti wrote:

On 2023-12-16 0:39, Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

[snip]

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less
decimal
and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB).

Are you both sure that those numbers are really lines per *second*? They
seem improbably high, and compilation speeds in those years used to be
stated in lines per *minute*.

Almost certainly per minute.
I worked on a compiler in 1975 that ran on the most powerful ICL 1900.
It achieved 20K cards per minute and was considered to be very fast.

--
Bill F.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to MitchAlsup on Sat Dec 16 14:25:19 2023

MitchAlsup wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they
exist (also this appears to be around the cutoff point for the
free version of Vivado as well; but one would have thought Xilinx
would have already gotten their money by someone having bought the >>>>>> FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The
FPGA
cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from
source.

In 1979 I joined a company with a FORTRAN mostly-77- that compiled
at 10,000 lines of code per second for an IBM-like minicomputer
(less decimal and string) and did a pretty good job of spitting out
high performance
code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient
code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be enough >>>>>> to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size
limit).

If there were an efficient way to run the device driver sack in
user-mode
without privilege and only the MMI/O pages this driver can touch
mapped
into his VAS. Poof none of the driver stack is in the kernel. --IF--

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer
the user provided, OH so long ago ??

Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.

OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?

The way you write you are assuming the device can write into the
user's code space when he ask for a read from one of his buffers !?!

You _could_ give device translations to anything and everything
in user space, but this seems excessive when the user only wants
the device to read/write small area inside his VaS.

OS code already has to manipulate PTE entries or MMU tables so
the device can write read-only and execute-only pages along with
removing write-permission on a page with data inbound from a device.

The OS can't remove the page RW access for a user mode page while an
IO device is DMA writing the page, if that's what you meant,
as the DMA-in may be writing to a smaller buffer within a larger page.
It is perfectly normal for a thread to continue to work in buffer
bytes adjacent to the one currently involved in an async IO.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Sat Dec 16 22:56:10 2023

BGB <[email protected]> schrieb:

On 12/16/2023 12:04 PM, moi wrote:

On 16/12/2023 07:22, Niklas Holsti wrote:

On 2023-12-16 0:39, Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

[snip]

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>> 10,000 lines of code per second for an IBM-like minicomputer (less
decimal
and string) and did a pretty good job of spitting out high performance >>>>> code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB).

Are you both sure that those numbers are really lines per *second*?
They seem improbably high, and compilation speeds in those years used
to be stated in lines per *minute*.

Almost certainly per minute.
I worked on a compiler in 1975 that ran on the most powerful ICL 1900.
It achieved 20K cards per minute and was considered to be very fast.

Lines per minute seems to make sense.

Modern PC's are orders of magnitude faster, but still don't have
"instant" compile times by any means.

Could be faster though, but would likely need languages other than C or (especially) C++.

I assume you never worked with Turbo Pascal.

That was amazing. It compiled code so fast that it was never a
bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
The first version I ever used, 3.0 (?) compiled from memory to
memory, so even slow I/O (to floppy disc, at the time) was not
an issue.

This was made possible by using a streamlined one-pass compiler. It
didn't do much optimization, but when the alternative was BASIC, the
generated code was still extremely fast by comparision.

There were a few drawbacks. The biggest one was that programming errors
tended to freeze the machine. Another (not so important) was that,
if you were one of the lucky people to have an 80x87 coprocessor, the
generated code did not check for overflow of the coprocessor stack.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Stephen Fuld on Sat Dec 16 23:40:54 2023

On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld wrote:

Anyway, has all development stopped? Or is their "sweat equity" model
still going on?

I've checked the Mill web site, and Ivan Godard last posted to the forums
there just five days ago. So I can only assume that all is well, but
perhaps he has entered a phase of work on the Mill that is keeping him
busy. Which would seem to be good news.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB-Alt on Sun Dec 17 00:24:58 2023

BGB-Alt <[email protected]> writes:

On 12/16/2023 1:25 PM, EricP wrote:

MitchAlsup wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they >>>>>>>> exist (also this appears to be around the cutoff point for the >>>>>>>> free version of Vivado as well; but one would have thought Xilinx >>>>>>>> would have already gotten their money by someone having bought >>>>>>>> the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day.   The >>>>>> FPGA
cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from >>>>>>>> source.

In 1979 I joined a company with a FORTRAN mostly-77- that compiled >>>>>>> at 10,000 lines of code per second for an IBM-like minicomputer
(less decimal and string) and did a pretty good job of spitting
out high performance
code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases, >>>>>> the languages were far simpler and much easier to generate efficient >>>>>> code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be
enough to fit:
   A basic kernel;
     (this excludes the Linux kernel, which is well over the size >>>>>>>> limit).

If there were an efficient way to run the device driver sack in
user-mode
without privilege and only the MMI/O pages this driver can touch >>>>>>> mapped
into his VAS. Poof none of the driver stack is in the kernel. --IF-- >>>>>

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature.    When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware.    Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer
the user provided, OH so long ago ??

Because the device wants to do DMA directly into or from the users
virtual address space.   Bulk transfer, not MMIO accesses.

OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?

The way you write you are assuming the device can write into the
user's code space when he ask for a read from one of his buffers !?!

You _could_ give device translations to anything and everything
in user space, but this seems excessive when the user only wants
the device to read/write small area inside his VaS.

OS code already has to manipulate PTE entries or MMU tables so
the device can write read-only and execute-only pages along with
removing write-permission on a page with data inbound from a device.

The OS can't remove the page RW access for a user mode page while an
IO device is DMA writing the page, if that's what you meant,
as the DMA-in may be writing to a smaller buffer within a larger page.
It is perfectly normal for a thread to continue to work in buffer
bytes adjacent to the one currently involved in an async IO.

One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).

https://www.dpdk.org/
https://opendataplane.org/

Are two very common use cases for usermode drivers.

Like, say, for a filesystem, it is presumably:
read syscall from user to OS;
route this to the corresponding VFS driver;
Requests spanning multiple blocks being broken up into parts;
VFS driver checks the block-cache / buffer-cache;
If found, copy from cache into user-space;
If not found, send request to the underlying block device;
Wait for response (and/or reschedule task for later);
Copy result back into userland.

No, it would be for the user mode application to access
disk/ssd/nvme blocks directly and impose whatever structure on those
blocks that it wishes. No OS intervention at all, DMA directly
into userspace instead of bouncing through kernel.

The NVME controllers use a command ring, and when virtualized,
each VF provides a command ring directly to the user mode
application - the application can insert commands (read, write,
erase, etc) into the ring, write to the doorbell register
a and wait for completion by polling or waiting for a virtio
interrupt.

Again the application is just reading blocks and interpreting
them any way it wishes (e.g. for a database application
which doesn't need a filesystem).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andreas Eder@21:1/5 to BGB on Sun Dec 17 12:23:52 2023

On Fr 15 Dez 2023 at 13:05, BGB <[email protected]> wrote:

Also, it would be nice to have a basically usable OS and core software
stack in under 1M lines.

Say, by not trying to be everything to everyone, and limiting how much
is allowed in the core OS (or is allowed within the build process for
the core OS).

Though, within moderate limits, 1M lines would basically be enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size limit).
A (moderate sized) C compiler;
(but not GCC, which is also well over this size limit).
A shell+utils comparable to BusyBox;
Various core OS libraries and similar, etc.

For this, will assume an at least nominally POSIX like environment.

Programs that run on the OS would not be counted in the line-count budget.

Have you had a look at plan9 yet?

'Andreas

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to BGB-Alt on Sun Dec 17 13:12:20 2023

BGB-Alt wrote:

On 12/16/2023 1:25 PM, EricP wrote:

MitchAlsup wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they >>>>>>>> exist (also this appears to be around the cutoff point for the >>>>>>>> free version of Vivado as well; but one would have thought
Xilinx would have already gotten their money by someone having >>>>>>>> bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day.
The FPGA
cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from >>>>>>>> source.

In 1979 I joined a company with a FORTRAN mostly-77- that
compiled at 10,000 lines of code per second for an IBM-like
minicomputer (less decimal and string) and did a pretty good job >>>>>>> of spitting out high performance
code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases, >>>>>> the languages were far simpler and much easier to generate efficient >>>>>> code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be
enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the
size limit).

If there were an efficient way to run the device driver sack in
user-mode
without privilege and only the MMI/O pages this driver can touch >>>>>>> mapped
into his VAS. Poof none of the driver stack is in the kernel.
--IF--

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer
the user provided, OH so long ago ??

Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.

OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?

The way you write you are assuming the device can write into the
user's code space when he ask for a read from one of his buffers !?!

You _could_ give device translations to anything and everything
in user space, but this seems excessive when the user only wants
the device to read/write small area inside his VaS.

OS code already has to manipulate PTE entries or MMU tables so
the device can write read-only and execute-only pages along with
removing write-permission on a page with data inbound from a device.

The OS can't remove the page RW access for a user mode page while an
IO device is DMA writing the page, if that's what you meant,
as the DMA-in may be writing to a smaller buffer within a larger page.
It is perfectly normal for a thread to continue to work in buffer
bytes adjacent to the one currently involved in an async IO.

One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).

Zero-copy IO. That has always been available on WinNT provided hardware supports it. General byte-buffer IO could always do zero-copy DMA,
with HW support. For files one can do IO direct to a user buffer with
certain restrictions, buffers must be file block size and alignment.
I haven't checked but guessing that if the file block is already in file
cache it gets copied, otherwise it DMA's directly to/from the user buffer. Normally one wants cached file blocks but there are times when one doesn't
and wants the more optimal direct buffer IO (eg, a video player).

There is also scatter-gather IO, intended for network cards,
where the IO is a list of byte sized and aligned virtual buffers.

The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO. A single virtual buffer becomes a list of
physical fragments, so a scatter-gather list becomes a list of lists
of physical byte buffer fragments, called a Memory Descriptor List (MDL)
in Windows.

And then SR-IOV adds virtual machines to the mix, where a guest OS
physical address becomes a hypervisor guest virtual address,
and not only are guest buffers in guest user space, but the guest OS
MDL's are themselves in hypervisor virtual space and require their own hypervisor MDL's (lists of lists of lists of fragments).

Like, say, for a filesystem, it is presumably:
read syscall from user to OS;
route this to the corresponding VFS driver;
Requests spanning multiple blocks being broken up into parts;
VFS driver checks the block-cache / buffer-cache;
If found, copy from cache into user-space;
If not found, send request to the underlying block device;
Wait for response (and/or reschedule task for later);
Copy result back into userland.

Yes, pretty much (there is page mangement, quota management).
Except if I request a direct IO it DMA's direct to/from the user buffer,
if hardware supports that.

Though, it may make sense that if a request isn't available immediately,
and there is some sort of DMA mechanism, the OS could block the task and
then resume it once the data becomes available. For polling IO, doesn't likely make much difference as the CPU is basically stuck in a busy loop either way until the IO finishes.

Yes, that's DMA resource management. Basically each system has a certain
number of scatter-gather IO mappers, now implemented by the IOMMU page table. Each IO queues a request for its mappers, and the DMA resource manager doles out a set of IO mapping registers, which may be less that you requested
in which case you break up your IO into multiple requests.
Then you program the scatter-gather map using info from the IO's MDL,
pass the mapped IO space addresses to the device, and Bob's your uncle.
When the IO completes, your driver tears down its IO map and releases
the mapping registers to the next waiting IO.

Though, could make sense for hardware accelerating pixel-copying
operations for a GUI.

On Windows the Gui is managed completely differently.
I'm not familiar enough with the details to comment other than to say
it is executed as privileged subroutines by the calling thread but in
super mode, which allows it direct access to the calling virtual space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to EricP on Sun Dec 17 19:24:09 2023

EricP <[email protected]> writes:

BGB-Alt wrote:

On 12/16/2023 1:25 PM, EricP wrote:

MitchAlsup wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).

Zero-copy IO. That has always been available on WinNT provided hardware >supports it. General byte-buffer IO could always do zero-copy DMA,
with HW support. For files one can do IO direct to a user buffer with
certain restrictions, buffers must be file block size and alignment.
I haven't checked but guessing that if the file block is already in file >cache it gets copied, otherwise it DMA's directly to/from the user buffer. >Normally one wants cached file blocks but there are times when one doesn't >and wants the more optimal direct buffer IO (eg, a video player).

There is also scatter-gather IO, intended for network cards,
where the IO is a list of byte sized and aligned virtual buffers.

The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO.

PCI express has an optional feature, PRI (Page Request Interface)
that allows the hardware to request that a page be 'pinned' just
for the duration of a DMA operation. The ARM64 server base system architecture document requires that the host support PRI. This
works in conjunction with PCIe ATS (Address Translation Services)
which allows the endpoint device to ask the host for translations
and cache them in the endpoint so the endpoint can use physical
addresses directly. This is usually implemented by the IOMMU
on the host treating the endpoint as if it had a remote TLB cache.

A single virtual buffer becomes a list of
physical fragments, so a scatter-gather list becomes a list of lists
of physical byte buffer fragments, called a Memory Descriptor List (MDL)
in Windows.

And then SR-IOV adds virtual machines to the mix,

Not necessarily just virtual machines - it's also used
to expose the virtual function to user mode code in
a bare metal (or virtualized) operating system.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Thomas Koenig on Mon Dec 18 12:11:27 2023

Thomas Koenig wrote:

BGB <[email protected]> schrieb:

Modern PC's are orders of magnitude faster, but still don't have
"instant" compile times by any means.

Could be faster though, but would likely need languages other than C or
(especially) C++.

I assume you never worked with Turbo Pascal.

I was going to bring up TP but you beat me to it. :-)

That was amazing. It compiled code so fast that it was never a
bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
The first version I ever used, 3.0 (?) compiled from memory to
memory, so even slow I/O (to floppy disc, at the time) was not
an issue.

TP1.0 was an executable which in ~37KB managed to fit an IDE, compiler/linker/loader/debugger and RTL, and if you abstained form
getting human readable error messages you could save about 1.5KB.

This was made possible by using a streamlined one-pass compiler. It
didn't do much optimization, but when the alternative was BASIC, the generated code was still extremely fast by comparision.

That compiler had zero optimation, it was a pure pattern match->emit
code engine that would reload the same variable from RAM on every
statement, but as you said, still far faster than the alternatives.

When speed was an actual issue I would switch to (inline) assembler,
even though that was initially just a way to embed machine code directly
so I had to assemble it in DEBUG.

There were a few drawbacks. The biggest one was that programming errors tended to freeze the machine. Another (not so important) was that,
if you were one of the lucky people to have an 80x87 coprocessor, the generated code did not check for overflow of the coprocessor stack.

The fp code generated by TP would never overflow the 87 stack afair,
since it would do single operations and pop the results at once?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Mon Dec 18 15:43:52 2023

"Paul A. Clayton" <[email protected]> writes:

On 12/17/23 2:24 PM, Scott Lurndal wrote:

EricP <[email protected]> writes:

[snip zero-copy and scatter-gather I/O]

The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO.

PCI express has an optional feature, PRI (Page Request Interface)
that allows the hardware to request that a page be 'pinned' just
for the duration of a DMA operation. The ARM64 server base system
architecture document requires that the host support PRI. This
works in conjunction with PCIe ATS (Address Translation Services)
which allows the endpoint device to ask the host for translations
and cache them in the endpoint so the endpoint can use physical
addresses directly. This is usually implemented by the IOMMU
on the host treating the endpoint as if it had a remote TLB cache.

Interesting. I had proposed some years ago that rather than
pinning a physical page for I/O a page be provided when needed
from a free list (including that the data could be cached/buffered
with a virtual address tag).

In most usage cases, the page being DMA'd from/to has other
unrelated data in it, rather than being fully dedicated to
a single buffer or set of buffers.

The PRI is more about making sure the OS makes the page present
before the DMA operation begins and ensuring that it won't go
away before the DMA operation ends.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Scott Lurndal on Mon Dec 18 13:59:34 2023

Scott Lurndal wrote:

EricP <[email protected]> writes:

BGB-Alt wrote:

On 12/16/2023 1:25 PM, EricP wrote:

MitchAlsup wrote:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).

Zero-copy IO. That has always been available on WinNT provided hardware
supports it. General byte-buffer IO could always do zero-copy DMA,
with HW support. For files one can do IO direct to a user buffer with
certain restrictions, buffers must be file block size and alignment.
I haven't checked but guessing that if the file block is already in file
cache it gets copied, otherwise it DMA's directly to/from the user buffer. >> Normally one wants cached file blocks but there are times when one doesn't >> and wants the more optimal direct buffer IO (eg, a video player).

There is also scatter-gather IO, intended for network cards,
where the IO is a list of byte sized and aligned virtual buffers.

The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO.

PCI express has an optional feature, PRI (Page Request Interface)
that allows the hardware to request that a page be 'pinned' just
for the duration of a DMA operation. The ARM64 server base system architecture document requires that the host support PRI. This
works in conjunction with PCIe ATS (Address Translation Services)
which allows the endpoint device to ask the host for translations
and cache them in the endpoint so the endpoint can use physical
addresses directly. This is usually implemented by the IOMMU
on the host treating the endpoint as if it had a remote TLB cache.

I don't know how one would make use of that on Windows as it completely separates the IO off so that the OS can switch to a different process
address space while the DMA takes place. The data structures to support
paging might not be easily accessible which would introduce long latency
in the middle of a DMA - which is exactly why it doesn't do this.
(I don't think Linux allows paging inside the OS or drivers either.)

On Windows you can have paging while managing a device if you put
the driver code in either a privileged user or super mode thread,
and then you deal with any timing issues.
The old floppy driver worked this way - as an OS thread.
But that was a very slow device and used programmed IO not DMA.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Tue Dec 19 14:28:10 2023

"Paul A. Clayton" <[email protected]> writes:

On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]

Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.

Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned.

The I/O device simple pushes data to the physical address. It's
the responsibility of the operating software to ensure the
physical address given to the device (either via ATS where the
device hosts the "tlb" or via the IOMMU) is correct and legal.

If the IOMMU translation tables mark the page as absent, an error response
will be returned to the device. If ATS was used, and the
host didn't invalidate the translation at the host, the
device will DMA to the specified physical address regardless
of whether it is the correct page.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Paul A. Clayton on Tue Jan 2 23:39:36 2024

"Paul A. Clayton" <[email protected]> writes:

On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:

On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]

Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.

Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned. I would *guess* that a DMA
operation that fails for an unvirtualized I/O device merely
presents an error.

If the page fault occurs in the level 1 table, Guest OS gets a
device page fault exception, if it happens in the level 2 table
HyperVisor gets a device page fault exception.

If the device can recover from page faults, the proper supervisor
"does OS stuff" and then signals the device to proceed with the
still pending device request. The "does OS stuff" does for the I/O
device pretty much what the proper supervisor does with a
CPU page fault--with all the nuances and idiosyncrasies (or more.)

If the HV encounters a device that cannot handle a page fault for
a page that it decided not to allocate but the OS did (knowing
that that specific device could not handle page faults), what
error status is sent to the OS? The HV cannot simply pass along a
"page fault" error because the OS _knows_ that the page was
allocated; that would break pure virtualization and potentially
seriously confuse the OS if virtualization was not considered as a >possibility (e.g., the OS might assume the device had either a
transient or persistent error that caused the wrong error type to
be returned, confirm it as persistent after the second encounter,
and mark the device as broken).

If the HV is allowing direct access to the device, and allowing
the device to use physical addresses via cached translations,
then the device must support both PCIe ATS and PRI. The former
handles the translations and the later requests that a page
be "pinned" for a subsequent DMA operation.

The HV controls the IOMMU which provides both the ATS and PRI interfaces
to the device. So the HV can invalidate a translation held in the
device (for ATS) or refuse to pin a page (or unpin a page).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	78:25:24
Calls:	12,450
Files:	15,194
Messages:	6,537,705

Whither the Mill?

Who's Online

System Info