• Re: Whither the =?UTF-8?B?TWlsbD8=?=

    From MitchAlsup@21:1/5 to BGB on Fri Dec 15 20:59:15 2023
    BGB wrote:

    On 12/15/2023 11:48 AM, George Neuner wrote:
    On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
    <[email protected]d> wrote:

    When we last heard from the merry band of Millers, they were looking for >>> substantial funding from a VC or similar. I suppose that if they had
    gotten it, we would have heard, so I guess they haven't.

    But I think there are things they could do to move forward even without
    a large investment. For example, they could develop an FPGA based
    system, even if it required multiple FPGAs on a custom circuit board for >>> not huge amounts of money. Whether this is worthwhile, I cannot say.

    Anyway, has all development stopped? Or is their "sweat equity" model
    still going on?

    Inquiring minds want to know.

    There was a post, ostensibly from Ivan, in their web forum just a few
    days ago. No news though - just an acknowledgement of another user's
    post.


    Last I heard, the next (current?) round of financing was - at least in
    part - to be used for FPGA "proof of concept" implementations.

    Problem is the Mill really is a SoC, and (to me at least) the design
    appears to be so complex that it would require a large, top-of-line
    (read "expensive") FPGA to fit all the functionality.


    Yeah. the lower end isn't cheap, the upper end is absurd...

    Look into the cost of making a mask-set at 7nm or at 3nm. Then we can
    have a discussion on how high the number has to be to rate absurd.

    For FPGA's over $1k, almost makes more sense to ignore that they exist
    (also this appears to be around the cutoff point for the free version of Vivado as well; but one would have thought Xilinx would have already
    gotten their money by someone having bought the FPGA?...).


    Then there is their idea that everything - from VHDL to software build
    toolchain to system software - be automatically generated from a
    simple functional specification. Getting THAT right is likely proving
    far more difficult than simply implementing a fixed design in an FPGA.


    Yeah.

    Long ago, I watched another project (FoNC, led by Alan Kay) that was
    also trying to go this route. I think the idea was that they wanted to
    try to find a way to describe the entire software stack (from OS to applications) in under 20k lines.

    Was the language of choice APL-like ??

    Practically, it seemed to mostly end up going nowhere best I can tell, a
    lot of "design", nothing that someone could actually use.



    Though, if one sets the limits a little higher, there is a lot one can do: One can at least, surely, make a usable compiler tool chain in under 1 million lines of code (at present, BGBCC weighs in at around 250 kLOC,
    could be smaller; but, fitting a "basically functional" C compiler into
    30k lines, or around the size of the Doom engine, seems a little harder).

    Though, an intermediate option, would be trying to pull off a "semi
    decent" compiler in under 100K lines.



    If the compiler is kept smaller, it is faster to recompile from source.

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
    10,000 lines of code per second for an IBM-like minicomputer (less decimal
    and string) and did a pretty good job of spitting out high performance
    code; on a machine with a 150ns cycle time.

    We now have compilers struggling to achieve 10,000 lines per second per CPU with machines of 0.2ns cycle time -- 75× faster {times the number of CPUs thrown at the problem.}

    Also, it would be nice to have a basically usable OS and core software
    stack in under 1M lines.

    There is no salable market for an OS that sheds featured for compactness.

    Say, by not trying to be everything to everyone, and limiting how much
    is allowed in the core OS (or is allowed within the build process for
    the core OS).

    Though, within moderate limits, 1M lines would basically be enough to fit:
    A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit).

    If there were an efficient way to run the device driver sack in user-mode without privilege and only the MMI/O pages this driver can touch mapped
    into his VAS. Poof none of the driver stack is in the kernel. --IF--

    A (moderate sized) C compiler;
    (but not GCC, which is also well over this size limit).

    In 1990 C was a small language, In 2023 that statement is no longer true.
    In 1990 the C compiler had 2 or 3 passes, in 2023 the LLVM compile has
    <what> 35 passes (some of them duplicates as one pass converts into some-
    thing a future pass will convert into something some other pass can
    optimize.)
    In 1990 your C compiler ran natively on your machine.
    In 2023 your LLVM compiler compiles 6+ front end languages and compiles
    to 20+ target ISAs and has to produce good code on all of them.

    A shell+utils comparable to BusyBox;

    Until someone prevents someone else from writing new shells, filters,
    and utilities, there is no way to moderate the growth in Shell+utils.

    Various core OS libraries and similar, etc.

    For this, will assume an at least nominally POSIX like environment.


    Programs that run on the OS would not be counted in the line-count budget.

    How to deal with multi-platform portability would be more of an open question, as this sort of thing tends to be a big source of code
    expansion (or, for an OS kernel, the matter of hardware drivers, ...).

    But, as can be noted, pretty much any project that gains mainstream popularity seems to spiral out of control regarding code-size.

    With 20TB disk drives, 32 GB main memory sizes, Fiber internet;
    what is the reason for worrying about something you can do almost
    nothing about.


    YMMV,

    Indeed.

    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Fri Dec 15 22:39:37 2023
    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they exist
    (also this appears to be around the cutoff point for the free version of
    Vivado as well; but one would have thought Xilinx would have already
    gotten their money by someone having bought the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day. The FPGA
    cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from source.

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
    10,000 lines of code per second for an IBM-like minicomputer (less decimal >and string) and did a pretty good job of spitting out high performance
    code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases,
    the languages were far simpler and much easier to generate efficient
    code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be enough to fit: >> A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit).

    If there were an efficient way to run the device driver sack in user-mode >without privilege and only the MMI/O pages this driver can touch mapped
    into his VAS. Poof none of the driver stack is in the kernel. --IF--

    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Fri Dec 15 23:02:13 2023
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they exist
    (also this appears to be around the cutoff point for the free version of >>> Vivado as well; but one would have thought Xilinx would have already
    gotten their money by someone having bought the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day. The FPGA
    cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from source.

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>10,000 lines of code per second for an IBM-like minicomputer (less decimal >>and string) and did a pretty good job of spitting out high performance >>code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases,
    the languages were far simpler and much easier to generate efficient
    code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be enough to fit: >>> A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit). >>
    If there were an efficient way to run the device driver sack in user-mode >>without privilege and only the MMI/O pages this driver can touch mapped >>into his VAS. Poof none of the driver stack is in the kernel. --IF--

    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer the
    user provided, OH so long ago ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Dec 16 00:04:09 2023
    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they exist >>>> (also this appears to be around the cutoff point for the free version of >>>> Vivado as well; but one would have thought Xilinx would have already
    gotten their money by someone having bought the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day. The FPGA
    cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from source. >>>
    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>and string) and did a pretty good job of spitting out high performance >>>code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases,
    the languages were far simpler and much easier to generate efficient
    code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be enough to fit: >>>> A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit). >>>
    If there were an efficient way to run the device driver sack in user-mode >>>without privilege and only the MMI/O pages this driver can touch mapped >>>into his VAS. Poof none of the driver stack is in the kernel. --IF--

    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer the
    user provided, OH so long ago ??


    Because the device wants to do DMA directly into or from the users
    virtual address space. Bulk transfer, not MMIO accesses.

    Think network controller fetching packets from userspace.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Dec 16 18:57:36 2023
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they exist >>>>> (also this appears to be around the cutoff point for the free version of >>>>> Vivado as well; but one would have thought Xilinx would have already >>>>> gotten their money by someone having bought the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day. The FPGA >>> cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from source. >>>>
    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>>and string) and did a pretty good job of spitting out high performance >>>>code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases,
    the languages were far simpler and much easier to generate efficient
    code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be enough to fit:
    A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit). >>>>
    If there were an efficient way to run the device driver sack in user-mode >>>>without privilege and only the MMI/O pages this driver can touch mapped >>>>into his VAS. Poof none of the driver stack is in the kernel. --IF--

    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer the
    user provided, OH so long ago ??


    Because the device wants to do DMA directly into or from the users
    virtual address space. Bulk transfer, not MMIO accesses.

    OK, I will ask the question in the contrapositive way::
    If the user ask device to read into a buffer, why does the device get
    to see everything of the user's space along with that buffer ?

    The way you write you are assuming the device can write into the
    user's code space when he ask for a read from one of his buffers !?!

    You _could_ give device translations to anything and everything
    in user space, but this seems excessive when the user only wants
    the device to read/write small area inside his VaS.

    OS code already has to manipulate PTE entries or MMU tables so
    the device can write read-only and execute-only pages along with
    removing write-permission on a page with data inbound from a device.

    Think network controller fetching packets from userspace.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sat Dec 16 21:42:59 2023
    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they exist >>>>>> (also this appears to be around the cutoff point for the free version of >>>>>> Vivado as well; but one would have thought Xilinx would have already >>>>>> gotten their money by someone having bought the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day. The FPGA >>>> cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from source. >>>>>
    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>>>and string) and did a pretty good job of spitting out high performance >>>>>code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases,
    the languages were far simpler and much easier to generate efficient
    code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be enough to fit:
    A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit).

    If there were an efficient way to run the device driver sack in user-mode >>>>>without privilege and only the MMI/O pages this driver can touch mapped >>>>>into his VAS. Poof none of the driver stack is in the kernel. --IF--

    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer the >>>user provided, OH so long ago ??


    Because the device wants to do DMA directly into or from the users
    virtual address space. Bulk transfer, not MMIO accesses.

    OK, I will ask the question in the contrapositive way::
    If the user ask device to read into a buffer, why does the device get
    to see everything of the user's space along with that buffer ?

    It doesn't, necessarily. The IOMMU translation table is a
    proper subset of the user's virtual address space. The
    application tells the kernel which portions of the address
    space are valid DMA regions for the device to access.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Sat Dec 16 22:58:54 2023
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they exist >>>>>>> (also this appears to be around the cutoff point for the free version of
    Vivado as well; but one would have thought Xilinx would have already >>>>>>> gotten their money by someone having bought the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day. The FPGA >>>>> cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from source. >>>>>>
    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal
    and string) and did a pretty good job of spitting out high performance >>>>>>code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases,
    the languages were far simpler and much easier to generate efficient >>>>> code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be enough to fit:
    A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit).

    If there were an efficient way to run the device driver sack in user-mode >>>>>>without privilege and only the MMI/O pages this driver can touch mapped >>>>>>into his VAS. Poof none of the driver stack is in the kernel. --IF-- >>>>
    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer the >>>>user provided, OH so long ago ??


    Because the device wants to do DMA directly into or from the users
    virtual address space. Bulk transfer, not MMIO accesses.

    OK, I will ask the question in the contrapositive way::
    If the user ask device to read into a buffer, why does the device get
    to see everything of the user's space along with that buffer ?

    It doesn't, necessarily. The IOMMU translation table is a
    proper subset of the user's virtual address space. The
    application tells the kernel which portions of the address
    space are valid DMA regions for the device to access.


    Which is my point !! you only want the device to see that <small> subset
    of the requesting application--not the whole address space. Done right
    the device can still use the application virtual address, but the device
    is not allowed to access stuff not associated with the request at hand
    right now.

    For example, you are a large entity and and Chinese disk drives are way
    less expensive than non-Chinese; so you buy some. Would you let those
    disk drives access anything in some requestors address space--no, you
    would only allow that device to access the user supplied buffer and
    whatever page rounding up that transpires.

    Principle of least Privilege works in the I/O space too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to BGB-Alt on Sat Dec 16 23:06:21 2023
    BGB-Alt wrote:

    On 12/16/2023 1:25 PM, EricP wrote:
    MitchAlsup wrote:

    One thing I don't get here is why there would be direct DMA between
    userland and the device (at least for filesystem and similar).

    Like, say, for a filesystem, it is presumably:
    read syscall from user to OS;
    route this to the corresponding VFS driver;
    Requests spanning multiple blocks being broken up into parts;
    VFS driver checks the block-cache / buffer-cache;
    If found, copy from cache into user-space;
    If not found, send request to the underlying block device;
    Wait for response (and/or reschedule task for later);
    Copy result back into userland.

    This is correct enough for a file system buffered by a disk cache.

    Are ALL file systems buffered in a disk cache ??

    Though, it may make sense that if a request isn't available immediately,
    and there is some sort of DMA mechanism, the OS could block the task and
    then resume it once the data becomes available. For polling IO, doesn't likely make much difference as the CPU is basically stuck in a busy loop either way until the IO finishes.


    Though, could make sense for hardware accelerating pixel-copying
    operations for a GUI.

    For GUI, there would be multiple stages of copying, say:
    Copying from user buffer to window buffer;
    Copying from window buffer to screen buffer;
    Copying from screen buffer to VRAM.

    For video playback or GL, there may be an additional stage of copying
    from GL's buffer to a user's buffer, then from the user's buffer to the window buffer. Though, considering possibly adding a shortcut path where
    GL and video codecs copy more directly into the window buffer (bypassing needing to pass the frame data through the userland program).

    Could be also possible maybe to have GL render directly into the window buffer, which could be possible if they have the same format/resolution,
    and the window buffer is physically mapped (say, for my current hardware rasterizer module).

    If running a program full-screen, it is possible to copy more directly
    from the user buffer into VRAM, saving some time here.

    Some time could be saved here if one had hardware support for these
    sorts of "copy pixel buffers around and convert between formats" tasks,
    but to be useful, this would need to be able to work with virtual
    memory, which adds some complexity (would either need to be CPU-like
    and/or have a page-walker; neither is particularly cheap).

    I have MM (memory to memory move:: memmove() if you will) that transmits
    up to 1 page of data as if atomically (single "bus" transaction.)

    Could maybe offload the task to the rasterizer module, but would need to
    add a page-walker to the rasterizer... Though, trying to deal with some scenarios (such as the final conversion/copy to VRAM) would add a lot of extra complexity. For now, its framebuffer/zbuffer/textures need to be
    in physically-mapped addresses (also with a 128-bit buffer alignment).


    Though, cheaper could be to make use of the second CPU core, but then schedule things like pixel copy operations to it (maybe also things like vertex transform and similar for OpenGL). Currently, if enabled, the
    second core hasn't seen a lot of use thus far in my case.

    ....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to All on Sat Dec 16 23:01:44 2023
    BGB-Alt wrote:

    Why did you acquire an alt ?? Ego perhaps ??

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Sun Dec 17 00:17:36 2023
    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they exist >>>>>>>> (also this appears to be around the cutoff point for the free version of
    Vivado as well; but one would have thought Xilinx would have already >>>>>>>> gotten their money by someone having bought the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day. The FPGA >>>>>> cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from source.

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>>>>10,000 lines of code per second for an IBM-like minicomputer (less decimal
    and string) and did a pretty good job of spitting out high performance >>>>>>>code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases, >>>>>> the languages were far simpler and much easier to generate efficient >>>>>> code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be enough to fit:
    A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit).

    If there were an efficient way to run the device driver sack in user-mode
    without privilege and only the MMI/O pages this driver can touch mapped >>>>>>>into his VAS. Poof none of the driver stack is in the kernel. --IF-- >>>>>
    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer the >>>>>user provided, OH so long ago ??


    Because the device wants to do DMA directly into or from the users
    virtual address space. Bulk transfer, not MMIO accesses.

    OK, I will ask the question in the contrapositive way::
    If the user ask device to read into a buffer, why does the device get
    to see everything of the user's space along with that buffer ?

    It doesn't, necessarily. The IOMMU translation table is a
    proper subset of the user's virtual address space. The
    application tells the kernel which portions of the address
    space are valid DMA regions for the device to access.


    Which is my point !! you only want the device to see that <small> subset
    of the requesting application--not the whole address space. Done right
    the device can still use the application virtual address, but the device
    is not allowed to access stuff not associated with the request at hand
    right now.

    I thought I made that clear from the start.


    For example, you are a large entity and and Chinese disk drives are way
    less expensive than non-Chinese; so you buy some. Would you let those
    disk drives access anything in some requestors address space--no, you
    would only allow that device to access the user supplied buffer and
    whatever page rounding up that transpires.

    So far as I know there are no chinese disk drives that support
    SR-IOV.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Mon Dec 18 17:39:01 2023
    Paul A. Clayton wrote:

    On 12/17/23 2:24 PM, Scott Lurndal wrote:
    EricP <[email protected]> writes:
    [snip zero-copy and scatter-gather I/O]
    The all interacts with DMA and page management because the physical
    page frames that contain the bytes must be pinned in memory for the
    duration of the DMA IO.

    PCI express has an optional feature, PRI (Page Request Interface)
    that allows the hardware to request that a page be 'pinned' just
    for the duration of a DMA operation. The ARM64 server base system
    architecture document requires that the host support PRI. This
    works in conjunction with PCIe ATS (Address Translation Services)
    which allows the endpoint device to ask the host for translations
    and cache them in the endpoint so the endpoint can use physical
    addresses directly. This is usually implemented by the IOMMU
    on the host treating the endpoint as if it had a remote TLB cache.

    Interesting. I had proposed some years ago that rather than
    pinning a physical page for I/O a page be provided when needed
    from a free list (including that the data could be cached/buffered
    with a virtual address tag).

    Guest OS can pin a guest physical page, but HyperVisor decides
    if the page is present or absent in memory.

    The Mill's backless memory is similar, deferring physical memory
    allocation until cache eviction using a free list (that is
    refilled by a thread that is activated at low water mark)



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Paul A. Clayton on Tue Dec 19 03:40:07 2023
    Paul A. Clayton wrote:

    On 12/18/23 12:39 PM, MitchAlsup wrote:
    [snip page pinning for DMA]
    Guest OS can pin a guest physical page, but HyperVisor decides
    if the page is present or absent in memory.

    Out of curiosity, what happens when an I/O device tries to DMA to
    a page which the OS thinks is pinned. I would *guess* that a DMA
    operation that fails for an unvirtualized I/O device merely
    presents an error.

    If the page fault occurs in the level 1 table, Guest OS gets a
    device page fault exception, if it happens in the level 2 table
    HyperVisor gets a device page fault exception.

    If the device can recover from page faults, the proper supervisor
    "does OS stuff" and then signals the device to proceed with the
    still pending device request. The "does OS stuff" does for the
    I/O device pretty much what the proper supervisor does with a
    CPU page fault--with all the nuances and idiosyncrasies (or more.)

    I would also guess that some I/O operations
    could be merely retried, but some might just be lost. For a
    virtualized I/O device, it would seem that the OS would be
    confused if a (virtual) physical page was reported as having an
    access error but perhaps there would be some generic transaction
    failed indicator with information about retrying.

    (Even with a pool of free pages and significant virtually tagged
    caching, a page freeing thread could be "outrun" by I/O requesting
    new pages.

    Les the "proper supervisor" sort it out. Keep HW out of the game.

    This presents denial of service attack potential as
    well as ordinary danger of resource starvation. [For short DMAs,
    caching-only might be practical with a main memory page never
    being allocated. This would require unpinning/binding the page
    after the data was copied; the copy could be "free" since the data
    would be transferred to a processor cache anyway.])

    Managing/avoiding oversubscription of resources is probably a week
    or more of a OS design course. I sometimes wish I could spend a
    few hundred years in a time bubble studying some of these things.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Wed Jan 3 18:00:54 2024
    Scott Lurndal wrote:

    "Paul A. Clayton" <[email protected]> writes:
    On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:

    On 12/18/23 12:39 PM, MitchAlsup wrote:
    [snip page pinning for DMA]
    Guest OS can pin a guest physical page, but HyperVisor decides
    if the page is present or absent in memory.

    Out of curiosity, what happens when an I/O device tries to DMA to
    a page which the OS thinks is pinned. I would *guess* that a DMA
    operation that fails for an unvirtualized I/O device merely
    presents an error.

    If the page fault occurs in the level 1 table, Guest OS gets a
    device page fault exception, if it happens in the level 2 table
    HyperVisor gets a device page fault exception.

    If the device can recover from page faults, the proper supervisor
    "does OS stuff" and then signals the device to proceed with the
    still pending device request. The "does OS stuff" does for the I/O
    device pretty much what the proper supervisor does with a
    CPU page fault--with all the nuances and idiosyncrasies (or more.)

    If the HV encounters a device that cannot handle a page fault for
    a page that it decided not to allocate but the OS did (knowing
    that that specific device could not handle page faults), what
    error status is sent to the OS? The HV cannot simply pass along a
    "page fault" error because the OS _knows_ that the page was
    allocated; that would break pure virtualization and potentially
    seriously confuse the OS if virtualization was not considered as a >>possibility (e.g., the OS might assume the device had either a
    transient or persistent error that caused the wrong error type to
    be returned, confirm it as persistent after the second encounter,
    and mark the device as broken).

    If the HV is allowing direct access to the device, and allowing
    the device to use physical addresses via cached translations,
    then the device must support both PCIe ATS and PRI. The former

    Or have a HostBridge that provides translation services to
    virtualized devices....

    handles the translations and the later requests that a page
    be "pinned" for a subsequent DMA operation.

    The HV controls the IOMMU which provides both the ATS and PRI interfaces
    to the device. So the HV can invalidate a translation held in the
    device (for ATS) or refuse to pin a page (or unpin a page).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to MitchAlsup on Wed Jan 3 18:12:13 2024
    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    "Paul A. Clayton" <[email protected]> writes:
    On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:

    On 12/18/23 12:39 PM, MitchAlsup wrote:
    [snip page pinning for DMA]
    Guest OS can pin a guest physical page, but HyperVisor decides
    if the page is present or absent in memory.

    Out of curiosity, what happens when an I/O device tries to DMA to
    a page which the OS thinks is pinned. I would *guess* that a DMA
    operation that fails for an unvirtualized I/O device merely
    presents an error.

    If the page fault occurs in the level 1 table, Guest OS gets a
    device page fault exception, if it happens in the level 2 table
    HyperVisor gets a device page fault exception.

    If the device can recover from page faults, the proper supervisor
    "does OS stuff" and then signals the device to proceed with the
    still pending device request. The "does OS stuff" does for the I/O
    device pretty much what the proper supervisor does with a
    CPU page fault--with all the nuances and idiosyncrasies (or more.)

    If the HV encounters a device that cannot handle a page fault for
    a page that it decided not to allocate but the OS did (knowing
    that that specific device could not handle page faults), what
    error status is sent to the OS? The HV cannot simply pass along a
    "page fault" error because the OS _knows_ that the page was
    allocated; that would break pure virtualization and potentially
    seriously confuse the OS if virtualization was not considered as a >>>possibility (e.g., the OS might assume the device had either a
    transient or persistent error that caused the wrong error type to
    be returned, confirm it as persistent after the second encounter,
    and mark the device as broken).

    If the HV is allowing direct access to the device, and allowing
    the device to use physical addresses via cached translations,
    then the device must support both PCIe ATS and PRI. The former

    Or have a HostBridge that provides translation services to
    virtualized devices....

    All of the major operating systems fully support PCIe ATS and PRI
    standards.

    Leveraging that makes your processor viable, using a custom host
    bridge doesn't.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Scott Lurndal on Wed Jan 3 21:07:50 2024
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    If the HV is allowing direct access to the device, and allowing
    the device to use physical addresses via cached translations,
    then the device must support both PCIe ATS and PRI. The former

    Or have a HostBridge that provides translation services to
    virtualized devices....

    All of the major operating systems fully support PCIe ATS and PRI
    standards.

    But there are existing devices which do not.

    Leveraging that makes your processor viable, using a custom host
    bridge doesn't.

    For devices that do not support, MY 66000 I/O MMU alleviates the
    difference.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)