Forum: >>> Magnum BBS <<<

Re: Whither the =?UTF-8?B?TWlsbD8=?=

From MitchAlsup@21:1/5 to BGB on Fri Dec 15 20:59:15 2023

BGB wrote:

On 12/15/2023 11:48 AM, George Neuner wrote:

On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
<[email protected]d> wrote:

When we last heard from the merry band of Millers, they were looking for >>> substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.

But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for >>> not huge amounts of money. Whether this is worthwhile, I cannot say.

Anyway, has all development stopped? Or is their "sweat equity" model
still going on?

Inquiring minds want to know.

There was a post, ostensibly from Ivan, in their web forum just a few
days ago. No news though - just an acknowledgement of another user's
post.

Last I heard, the next (current?) round of financing was - at least in
part - to be used for FPGA "proof of concept" implementations.

Problem is the Mill really is a SoC, and (to me at least) the design
appears to be so complex that it would require a large, top-of-line
(read "expensive") FPGA to fit all the functionality.

Yeah. the lower end isn't cheap, the upper end is absurd...

Look into the cost of making a mask-set at 7nm or at 3nm. Then we can
have a discussion on how high the number has to be to rate absurd.

For FPGA's over $1k, almost makes more sense to ignore that they exist
(also this appears to be around the cutoff point for the free version of Vivado as well; but one would have thought Xilinx would have already
gotten their money by someone having bought the FPGA?...).

Then there is their idea that everything - from VHDL to software build
toolchain to system software - be automatically generated from a
simple functional specification. Getting THAT right is likely proving
far more difficult than simply implementing a fixed design in an FPGA.

Yeah.

Long ago, I watched another project (FoNC, led by Alan Kay) that was
also trying to go this route. I think the idea was that they wanted to
try to find a way to describe the entire software stack (from OS to applications) in under 20k lines.

Was the language of choice APL-like ??

Practically, it seemed to mostly end up going nowhere best I can tell, a
lot of "design", nothing that someone could actually use.

Though, if one sets the limits a little higher, there is a lot one can do: One can at least, surely, make a usable compiler tool chain in under 1 million lines of code (at present, BGBCC weighs in at around 250 kLOC,
could be smaller; but, fitting a "basically functional" C compiler into
30k lines, or around the size of the Doom engine, seems a little harder).

Though, an intermediate option, would be trying to pull off a "semi
decent" compiler in under 100K lines.

If the compiler is kept smaller, it is faster to recompile from source.

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal
and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.

We now have compilers struggling to achieve 10,000 lines per second per CPU with machines of 0.2ns cycle time -- 75× faster {times the number of CPUs thrown at the problem.}

Also, it would be nice to have a basically usable OS and core software
stack in under 1M lines.

There is no salable market for an OS that sheds featured for compactness.

Say, by not trying to be everything to everyone, and limiting how much
is allowed in the core OS (or is allowed within the build process for
the core OS).

Though, within moderate limits, 1M lines would basically be enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size limit).

If there were an efficient way to run the device driver sack in user-mode without privilege and only the MMI/O pages this driver can touch mapped
into his VAS. Poof none of the driver stack is in the kernel. --IF--

A (moderate sized) C compiler;
(but not GCC, which is also well over this size limit).

In 1990 C was a small language, In 2023 that statement is no longer true.
In 1990 the C compiler had 2 or 3 passes, in 2023 the LLVM compile has
<what> 35 passes (some of them duplicates as one pass converts into some-
thing a future pass will convert into something some other pass can
optimize.)
In 1990 your C compiler ran natively on your machine.
In 2023 your LLVM compiler compiles 6+ front end languages and compiles
to 20+ target ISAs and has to produce good code on all of them.

A shell+utils comparable to BusyBox;

Until someone prevents someone else from writing new shells, filters,
and utilities, there is no way to moderate the growth in Shell+utils.

Various core OS libraries and similar, etc.

For this, will assume an at least nominally POSIX like environment.

Programs that run on the OS would not be counted in the line-count budget.

How to deal with multi-platform portability would be more of an open question, as this sort of thing tends to be a big source of code
expansion (or, for an OS kernel, the matter of hardware drivers, ...).

But, as can be noted, pretty much any project that gains mainstream popularity seems to spiral out of control regarding code-size.

With 20TB disk drives, 32 GB main memory sizes, Fiber internet;
what is the reason for worrying about something you can do almost
nothing about.

YMMV,

Indeed.

George

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Fri Dec 15 22:39:37 2023

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they exist
(also this appears to be around the cutoff point for the free version of
Vivado as well; but one would have thought Xilinx would have already
gotten their money by someone having bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The FPGA
cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from source.

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient
code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be enough to fit: >> A basic kernel;
(this excludes the Linux kernel, which is well over the size limit).

If there were an efficient way to run the device driver sack in user-mode >without privilege and only the MMI/O pages this driver can touch mapped
into his VAS. Poof none of the driver stack is in the kernel. --IF--

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Fri Dec 15 23:02:13 2023

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they exist
(also this appears to be around the cutoff point for the free version of >>> Vivado as well; but one would have thought Xilinx would have already
gotten their money by someone having bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The FPGA
cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from source.

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>10,000 lines of code per second for an IBM-like minicomputer (less decimal >>and string) and did a pretty good job of spitting out high performance >>code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient
code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be enough to fit: >>> A basic kernel;
(this excludes the Linux kernel, which is well over the size limit). >>

If there were an efficient way to run the device driver sack in user-mode >>without privilege and only the MMI/O pages this driver can touch mapped >>into his VAS. Poof none of the driver stack is in the kernel. --IF--

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer the
user provided, OH so long ago ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sat Dec 16 00:04:09 2023

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they exist >>>> (also this appears to be around the cutoff point for the free version of >>>> Vivado as well; but one would have thought Xilinx would have already
gotten their money by someone having bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The FPGA
cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from source. >>>

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>and string) and did a pretty good job of spitting out high performance >>>code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient
code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be enough to fit: >>>> A basic kernel;
(this excludes the Linux kernel, which is well over the size limit). >>>

If there were an efficient way to run the device driver sack in user-mode >>>without privilege and only the MMI/O pages this driver can touch mapped >>>into his VAS. Poof none of the driver stack is in the kernel. --IF--

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer the
user provided, OH so long ago ??

Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.

Think network controller fetching packets from userspace.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sat Dec 16 18:57:36 2023

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they exist >>>>> (also this appears to be around the cutoff point for the free version of >>>>> Vivado as well; but one would have thought Xilinx would have already >>>>> gotten their money by someone having bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The FPGA >>> cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from source. >>>>

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>>and string) and did a pretty good job of spitting out high performance >>>>code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient
code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size limit). >>>>

If there were an efficient way to run the device driver sack in user-mode >>>>without privilege and only the MMI/O pages this driver can touch mapped >>>>into his VAS. Poof none of the driver stack is in the kernel. --IF--

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer the
user provided, OH so long ago ??

Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.

OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?

The way you write you are assuming the device can write into the
user's code space when he ask for a read from one of his buffers !?!

You _could_ give device translations to anything and everything
in user space, but this seems excessive when the user only wants
the device to read/write small area inside his VaS.

OS code already has to manipulate PTE entries or MMU tables so
the device can write read-only and execute-only pages along with
removing write-permission on a page with data inbound from a device.

Think network controller fetching packets from userspace.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sat Dec 16 21:42:59 2023

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they exist >>>>>> (also this appears to be around the cutoff point for the free version of >>>>>> Vivado as well; but one would have thought Xilinx would have already >>>>>> gotten their money by someone having bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The FPGA >>>> cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from source. >>>>>

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>>>and string) and did a pretty good job of spitting out high performance >>>>>code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient
code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size limit).

If there were an efficient way to run the device driver sack in user-mode >>>>>without privilege and only the MMI/O pages this driver can touch mapped >>>>>into his VAS. Poof none of the driver stack is in the kernel. --IF--

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer the >>>user provided, OH so long ago ??

Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.

OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?

It doesn't, necessarily. The IOMMU translation table is a
proper subset of the user's virtual address space. The
application tells the kernel which portions of the address
space are valid DMA regions for the device to access.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Sat Dec 16 22:58:54 2023

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they exist >>>>>>> (also this appears to be around the cutoff point for the free version of
Vivado as well; but one would have thought Xilinx would have already >>>>>>> gotten their money by someone having bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The FPGA >>>>> cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from source. >>>>>>

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal
and string) and did a pretty good job of spitting out high performance >>>>>>code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient >>>>> code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size limit).

If there were an efficient way to run the device driver sack in user-mode >>>>>>without privilege and only the MMI/O pages this driver can touch mapped >>>>>>into his VAS. Poof none of the driver stack is in the kernel. --IF-- >>>>

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer the >>>>user provided, OH so long ago ??

Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.

OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?

It doesn't, necessarily. The IOMMU translation table is a
proper subset of the user's virtual address space. The
application tells the kernel which portions of the address
space are valid DMA regions for the device to access.

Which is my point !! you only want the device to see that <small> subset
of the requesting application--not the whole address space. Done right
the device can still use the application virtual address, but the device
is not allowed to access stuff not associated with the request at hand
right now.

For example, you are a large entity and and Chinese disk drives are way
less expensive than non-Chinese; so you buy some. Would you let those
disk drives access anything in some requestors address space--no, you
would only allow that device to access the user supplied buffer and
whatever page rounding up that transpires.

Principle of least Privilege works in the I/O space too.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to BGB-Alt on Sat Dec 16 23:06:21 2023

BGB-Alt wrote:

On 12/16/2023 1:25 PM, EricP wrote:

MitchAlsup wrote:

One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).

Like, say, for a filesystem, it is presumably:
read syscall from user to OS;
route this to the corresponding VFS driver;
Requests spanning multiple blocks being broken up into parts;
VFS driver checks the block-cache / buffer-cache;
If found, copy from cache into user-space;
If not found, send request to the underlying block device;
Wait for response (and/or reschedule task for later);
Copy result back into userland.

This is correct enough for a file system buffered by a disk cache.

Are ALL file systems buffered in a disk cache ??

Though, it may make sense that if a request isn't available immediately,
and there is some sort of DMA mechanism, the OS could block the task and
then resume it once the data becomes available. For polling IO, doesn't likely make much difference as the CPU is basically stuck in a busy loop either way until the IO finishes.

Though, could make sense for hardware accelerating pixel-copying
operations for a GUI.

For GUI, there would be multiple stages of copying, say:
Copying from user buffer to window buffer;
Copying from window buffer to screen buffer;
Copying from screen buffer to VRAM.

For video playback or GL, there may be an additional stage of copying
from GL's buffer to a user's buffer, then from the user's buffer to the window buffer. Though, considering possibly adding a shortcut path where
GL and video codecs copy more directly into the window buffer (bypassing needing to pass the frame data through the userland program).

Could be also possible maybe to have GL render directly into the window buffer, which could be possible if they have the same format/resolution,
and the window buffer is physically mapped (say, for my current hardware rasterizer module).

If running a program full-screen, it is possible to copy more directly
from the user buffer into VRAM, saving some time here.

Some time could be saved here if one had hardware support for these
sorts of "copy pixel buffers around and convert between formats" tasks,
but to be useful, this would need to be able to work with virtual
memory, which adds some complexity (would either need to be CPU-like
and/or have a page-walker; neither is particularly cheap).

I have MM (memory to memory move:: memmove() if you will) that transmits
up to 1 page of data as if atomically (single "bus" transaction.)

Could maybe offload the task to the rasterizer module, but would need to
add a page-walker to the rasterizer... Though, trying to deal with some scenarios (such as the final conversion/copy to VRAM) would add a lot of extra complexity. For now, its framebuffer/zbuffer/textures need to be
in physically-mapped addresses (also with a 128-bit buffer alignment).

Though, cheaper could be to make use of the second CPU core, but then schedule things like pixel copy operations to it (maybe also things like vertex transform and similar for OpenGL). Currently, if enabled, the
second core hasn't seen a lot of use thus far in my case.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to All on Sat Dec 16 23:01:44 2023

BGB-Alt wrote:

Why did you acquire an alt ?? Ego perhaps ??

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Sun Dec 17 00:17:36 2023

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

BGB wrote:

For FPGA's over $1k, almost makes more sense to ignore that they exist >>>>>>>> (also this appears to be around the cutoff point for the free version of
Vivado as well; but one would have thought Xilinx would have already >>>>>>>> gotten their money by someone having bought the FPGA?...).

For anyone serious, an verif engineer can cost $500-1000/day. The FPGA >>>>>> cost is in the noise.

For a hobby? Well...

If the compiler is kept smaller, it is faster to recompile from source.

In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal
and string) and did a pretty good job of spitting out high performance >>>>>>>code; on a machine with a 150ns cycle time.

As did our COBOL compiler (which ran in 50KB). But in both cases, >>>>>> the languages were far simpler and much easier to generate efficient >>>>>> code than languages like Modula, Pascal, C, et alia.

Though, within moderate limits, 1M lines would basically be enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size limit).

If there were an efficient way to run the device driver sack in user-mode
without privilege and only the MMI/O pages this driver can touch mapped >>>>>>>into his VAS. Poof none of the driver stack is in the kernel. --IF-- >>>>>

That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.

An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.

Why should device be able to access user VaS outside of the buffer the >>>>>user provided, OH so long ago ??

Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.

OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?

It doesn't, necessarily. The IOMMU translation table is a
proper subset of the user's virtual address space. The
application tells the kernel which portions of the address
space are valid DMA regions for the device to access.

Which is my point !! you only want the device to see that <small> subset
of the requesting application--not the whole address space. Done right
the device can still use the application virtual address, but the device
is not allowed to access stuff not associated with the request at hand
right now.

I thought I made that clear from the start.

For example, you are a large entity and and Chinese disk drives are way
less expensive than non-Chinese; so you buy some. Would you let those
disk drives access anything in some requestors address space--no, you
would only allow that device to access the user supplied buffer and
whatever page rounding up that transpires.

So far as I know there are no chinese disk drives that support
SR-IOV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 18 17:39:01 2023

Paul A. Clayton wrote:

On 12/17/23 2:24 PM, Scott Lurndal wrote:

EricP <[email protected]> writes:

[snip zero-copy and scatter-gather I/O]

The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO.

PCI express has an optional feature, PRI (Page Request Interface)
that allows the hardware to request that a page be 'pinned' just
for the duration of a DMA operation. The ARM64 server base system
architecture document requires that the host support PRI. This
works in conjunction with PCIe ATS (Address Translation Services)
which allows the endpoint device to ask the host for translations
and cache them in the endpoint so the endpoint can use physical
addresses directly. This is usually implemented by the IOMMU
on the host treating the endpoint as if it had a remote TLB cache.

Interesting. I had proposed some years ago that rather than
pinning a physical page for I/O a page be provided when needed
from a free list (including that the data could be cached/buffered
with a virtual address tag).

Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.

The Mill's backless memory is similar, deferring physical memory
allocation until cache eviction using a free list (that is
refilled by a thread that is activated at low water mark)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Dec 19 03:40:07 2023

Paul A. Clayton wrote:

On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]

Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.

Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned. I would *guess* that a DMA
operation that fails for an unvirtualized I/O device merely
presents an error.

If the page fault occurs in the level 1 table, Guest OS gets a
device page fault exception, if it happens in the level 2 table
HyperVisor gets a device page fault exception.

If the device can recover from page faults, the proper supervisor
"does OS stuff" and then signals the device to proceed with the
still pending device request. The "does OS stuff" does for the
I/O device pretty much what the proper supervisor does with a
CPU page fault--with all the nuances and idiosyncrasies (or more.)

I would also guess that some I/O operations
could be merely retried, but some might just be lost. For a
virtualized I/O device, it would seem that the OS would be
confused if a (virtual) physical page was reported as having an
access error but perhaps there would be some generic transaction
failed indicator with information about retrying.

(Even with a pool of free pages and significant virtually tagged
caching, a page freeing thread could be "outrun" by I/O requesting
new pages.

Les the "proper supervisor" sort it out. Keep HW out of the game.

This presents denial of service attack potential as
well as ordinary danger of resource starvation. [For short DMAs,
caching-only might be practical with a main memory page never
being allocated. This would require unpinning/binding the page
after the data was copied; the copy could be "free" since the data
would be transferred to a processor cache anyway.])

Managing/avoiding oversubscription of resources is probably a week
or more of a OS design course. I sometimes wish I could spend a
few hundred years in a time bubble studying some of these things.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Wed Jan 3 18:00:54 2024

Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:

On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]

Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.

Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned. I would *guess* that a DMA
operation that fails for an unvirtualized I/O device merely
presents an error.

If the page fault occurs in the level 1 table, Guest OS gets a
device page fault exception, if it happens in the level 2 table
HyperVisor gets a device page fault exception.

If the device can recover from page faults, the proper supervisor
"does OS stuff" and then signals the device to proceed with the
still pending device request. The "does OS stuff" does for the I/O
device pretty much what the proper supervisor does with a
CPU page fault--with all the nuances and idiosyncrasies (or more.)

If the HV encounters a device that cannot handle a page fault for
a page that it decided not to allocate but the OS did (knowing
that that specific device could not handle page faults), what
error status is sent to the OS? The HV cannot simply pass along a
"page fault" error because the OS _knows_ that the page was
allocated; that would break pure virtualization and potentially
seriously confuse the OS if virtualization was not considered as a >>possibility (e.g., the OS might assume the device had either a
transient or persistent error that caused the wrong error type to
be returned, confirm it as persistent after the second encounter,
and mark the device as broken).

If the HV is allowing direct access to the device, and allowing
the device to use physical addresses via cached translations,
then the device must support both PCIe ATS and PRI. The former

Or have a HostBridge that provides translation services to
virtualized devices....

handles the translations and the later requests that a page
be "pinned" for a subsequent DMA operation.

The HV controls the IOMMU which provides both the ATS and PRI interfaces
to the device. So the HV can invalidate a translation held in the
device (for ATS) or refuse to pin a page (or unpin a page).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to MitchAlsup on Wed Jan 3 18:12:13 2024

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

"Paul A. Clayton" <[email protected]> writes:

On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:

On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]

Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.

Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned. I would *guess* that a DMA
operation that fails for an unvirtualized I/O device merely
presents an error.

If the page fault occurs in the level 1 table, Guest OS gets a
device page fault exception, if it happens in the level 2 table
HyperVisor gets a device page fault exception.

If the device can recover from page faults, the proper supervisor
"does OS stuff" and then signals the device to proceed with the
still pending device request. The "does OS stuff" does for the I/O
device pretty much what the proper supervisor does with a
CPU page fault--with all the nuances and idiosyncrasies (or more.)

If the HV encounters a device that cannot handle a page fault for
a page that it decided not to allocate but the OS did (knowing
that that specific device could not handle page faults), what
error status is sent to the OS? The HV cannot simply pass along a
"page fault" error because the OS _knows_ that the page was
allocated; that would break pure virtualization and potentially
seriously confuse the OS if virtualization was not considered as a >>>possibility (e.g., the OS might assume the device had either a
transient or persistent error that caused the wrong error type to
be returned, confirm it as persistent after the second encounter,
and mark the device as broken).

If the HV is allowing direct access to the device, and allowing
the device to use physical addresses via cached translations,
then the device must support both PCIe ATS and PRI. The former

Or have a HostBridge that provides translation services to
virtualized devices....

All of the major operating systems fully support PCIe ATS and PRI
standards.

Leveraging that makes your processor viable, using a custom host
bridge doesn't.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup@21:1/5 to Scott Lurndal on Wed Jan 3 21:07:50 2024

Scott Lurndal wrote:

[email protected] (MitchAlsup) writes:

Scott Lurndal wrote:

If the HV is allowing direct access to the device, and allowing
the device to use physical addresses via cached translations,
then the device must support both PCIe ATS and PRI. The former

Or have a HostBridge that provides translation services to
virtualized devices....

All of the major operating systems fully support PCIe ATS and PRI
standards.

But there are existing devices which do not.

Leveraging that makes your processor viable, using a custom host
bridge doesn't.

For devices that do not support, MY 66000 I/O MMU alleviates the
difference.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet
- Bob Worm
  Mon Jul 27 15:19:55 2026
  from Wales, Uk via Telnet
- Rixter
  Mon Jul 27 13:04:59 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (3 / 13)
Uptime:	44:56:14
Calls:	12,444
Calls today:	4
Files:	15,192
Messages:	6,537,093

Re: Whither the =?UTF-8?B?TWlsbD8=?=

Who's Online

Recent Visitors

System Info