• Whither the Mill?

    From Stephen Fuld@21:1/5 to All on Wed Dec 13 08:25:39 2023
    When we last heard from the merry band of Millers, they were looking for substantial funding from a VC or similar. I suppose that if they had
    gotten it, we would have heard, so I guess they haven't.

    But I think there are things they could do to move forward even without
    a large investment. For example, they could develop an FPGA based
    system, even if it required multiple FPGAs on a custom circuit board for
    not huge amounts of money. Whether this is worthwhile, I cannot say.

    Anyway, has all development stopped? Or is their "sweat equity" model
    still going on?

    Inquiring minds want to know.


    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stephen Fuld on Wed Dec 13 17:32:54 2023
    Stephen Fuld <[email protected]d> writes:
    When we last heard from the merry band of Millers, they were looking for >substantial funding from a VC or similar. I suppose that if they had
    gotten it, we would have heard, so I guess they haven't.

    But I think there are things they could do to move forward even without
    a large investment. For example, they could develop an FPGA based
    system, even if it required multiple FPGAs on a custom circuit board for
    not huge amounts of money. Whether this is worthwhile, I cannot say.


    There might even be some way of renting time on a real
    emulator from cadence (Palladium) or synopsys (Zebu).

    Although in my experience those who have them use them
    24x7.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From George Neuner@21:1/5 to [email protected] on Fri Dec 15 12:48:00 2023
    On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
    <[email protected]d> wrote:

    When we last heard from the merry band of Millers, they were looking for >substantial funding from a VC or similar. I suppose that if they had
    gotten it, we would have heard, so I guess they haven't.

    But I think there are things they could do to move forward even without
    a large investment. For example, they could develop an FPGA based
    system, even if it required multiple FPGAs on a custom circuit board for
    not huge amounts of money. Whether this is worthwhile, I cannot say.

    Anyway, has all development stopped? Or is their "sweat equity" model
    still going on?

    Inquiring minds want to know.

    There was a post, ostensibly from Ivan, in their web forum just a few
    days ago. No news though - just an acknowledgement of another user's
    post.


    Last I heard, the next (current?) round of financing was - at least in
    part - to be used for FPGA "proof of concept" implementations.

    Problem is the Mill really is a SoC, and (to me at least) the design
    appears to be so complex that it would require a large, top-of-line
    (read "expensive") FPGA to fit all the functionality.

    Then there is their idea that everything - from VHDL to software build toolchain to system software - be automatically generated from a
    simple functional specification. Getting THAT right is likely proving
    far more difficult than simply implementing a fixed design in an FPGA.

    YMMV,
    George

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB on Fri Dec 15 16:17:41 2023
    BGB wrote:
    On 12/15/2023 11:48 AM, George Neuner wrote:
    On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
    <[email protected]d> wrote:

    When we last heard from the merry band of Millers, they were looking for >>> substantial funding from a VC or similar. I suppose that if they had
    gotten it, we would have heard, so I guess they haven't.

    But I think there are things they could do to move forward even without
    a large investment. For example, they could develop an FPGA based
    system, even if it required multiple FPGAs on a custom circuit board for >>> not huge amounts of money. Whether this is worthwhile, I cannot say.

    Anyway, has all development stopped? Or is their "sweat equity" model
    still going on?

    Inquiring minds want to know.

    There was a post, ostensibly from Ivan, in their web forum just a few
    days ago. No news though - just an acknowledgement of another user's
    post.


    Last I heard, the next (current?) round of financing was - at least in
    part - to be used for FPGA "proof of concept" implementations.

    Problem is the Mill really is a SoC, and (to me at least) the design
    appears to be so complex that it would require a large, top-of-line
    (read "expensive") FPGA to fit all the functionality.


    Yeah. the lower end isn't cheap, the upper end is absurd...

    For FPGA's over $1k, almost makes more sense to ignore that they exist
    (also this appears to be around the cutoff point for the free version of Vivado as well; but one would have thought Xilinx would have already
    gotten their money by someone having bought the FPGA?...).

    Found a recent article that says Xilinx prices run from 8$ to $100,
    low end Intel fpga's start at $3, but the high end Stratix models
    go from $10,000 to $100,000.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Niklas Holsti@21:1/5 to Scott Lurndal on Sat Dec 16 09:22:32 2023
    On 2023-12-16 0:39, Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:

    [snip]

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
    10,000 lines of code per second for an IBM-like minicomputer (less decimal >> and string) and did a pretty good job of spitting out high performance
    code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB).


    Are you both sure that those numbers are really lines per *second*? They
    seem improbably high, and compilation speeds in those years used to be
    stated in lines per *minute*.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Niklas Holsti on Sat Dec 16 12:14:22 2023
    Niklas Holsti <[email protected]d> writes:
    On 2023-12-16 0:39, Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:

    [snip]

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
    10,000 lines of code per second for an IBM-like minicomputer (less decimal >>> and string) and did a pretty good job of spitting out high performance
    code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB).


    Are you both sure that those numbers are really lines per *second*? They
    seem improbably high, and compilation speeds in those years used to be
    stated in lines per *minute*.

    Especially given that 10Klines/s is probably around 500KB/s which has
    to be read from disk and probably a similar amount that has to be
    written to disk. What were the I/O throughputs available at the time?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Anton Ertl on Sat Dec 16 12:30:47 2023
    Anton Ertl <[email protected]> schrieb:
    Niklas Holsti <[email protected]d> writes:
    On 2023-12-16 0:39, Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:

    [snip]

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
    10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>> and string) and did a pretty good job of spitting out high performance >>>> code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB).


    Are you both sure that those numbers are really lines per *second*? They >>seem improbably high, and compilation speeds in those years used to be >>stated in lines per *minute*.

    Especially given that 10Klines/s is probably around 500KB/s which has
    to be read from disk and probably a similar amount that has to be
    written to disk. What were the I/O throughputs available at the time?

    It depends a bit how the Fortran and Cobol statements were stored.
    If they were stored in punched card format, 80 characters per line,
    then it would be 800000 characters per second read. Object code,
    probably much less, but the total could still come to around
    1 MB/s.

    The IBM 3350 (introduced in 1975) is probably fairly representative
    of the high end of that era, it had a data transfer speed of 1198
    kB/second, and a seek time of 25 milliseconds.

    So, 10000 lines/s would almost definitely have been I/O bound at the
    time.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Niklas Holsti on Sat Dec 16 15:11:03 2023
    Niklas Holsti <[email protected]d> writes:
    On 2023-12-16 0:39, Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:

    [snip]

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
    10,000 lines of code per second for an IBM-like minicomputer (less decimal >>> and string) and did a pretty good job of spitting out high performance
    code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB).


    Are you both sure that those numbers are really lines per *second*? They
    seem improbably high, and compilation speeds in those years used to be
    stated in lines per *minute*.

    Yes, lines per minute is the proper metric. Note that for many
    years, the compilation rate was bounded by the speed of the card
    reader (300 to 600 cards per minute).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From moi@21:1/5 to Niklas Holsti on Sat Dec 16 18:04:48 2023
    On 16/12/2023 07:22, Niklas Holsti wrote:
    On 2023-12-16 0:39, Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:

       [snip]

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
    10,000 lines of code per second for an IBM-like minicomputer (less
    decimal
    and string) and did a pretty good job of spitting out high performance
    code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB).


    Are you both sure that those numbers are really lines per *second*? They
    seem improbably high, and compilation speeds in those years used to be
    stated in lines per *minute*.


    Almost certainly per minute.
    I worked on a compiler in 1975 that ran on the most powerful ICL 1900.
    It achieved 20K cards per minute and was considered to be very fast.

    --
    Bill F.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sat Dec 16 14:25:19 2023
    MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they
    exist (also this appears to be around the cutoff point for the
    free version of Vivado as well; but one would have thought Xilinx
    would have already gotten their money by someone having bought the >>>>>> FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day. The
    FPGA
    cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from
    source.

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled
    at 10,000 lines of code per second for an IBM-like minicomputer
    (less decimal and string) and did a pretty good job of spitting out
    high performance
    code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases,
    the languages were far simpler and much easier to generate efficient
    code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be enough >>>>>> to fit:
    A basic kernel;
    (this excludes the Linux kernel, which is well over the size
    limit).

    If there were an efficient way to run the device driver sack in
    user-mode
    without privilege and only the MMI/O pages this driver can touch
    mapped
    into his VAS. Poof none of the driver stack is in the kernel. --IF--

    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer
    the user provided, OH so long ago ??


    Because the device wants to do DMA directly into or from the users
    virtual address space. Bulk transfer, not MMIO accesses.

    OK, I will ask the question in the contrapositive way::
    If the user ask device to read into a buffer, why does the device get
    to see everything of the user's space along with that buffer ?

    The way you write you are assuming the device can write into the
    user's code space when he ask for a read from one of his buffers !?!

    You _could_ give device translations to anything and everything
    in user space, but this seems excessive when the user only wants
    the device to read/write small area inside his VaS.

    OS code already has to manipulate PTE entries or MMU tables so
    the device can write read-only and execute-only pages along with
    removing write-permission on a page with data inbound from a device.

    The OS can't remove the page RW access for a user mode page while an
    IO device is DMA writing the page, if that's what you meant,
    as the DMA-in may be writing to a smaller buffer within a larger page.
    It is perfectly normal for a thread to continue to work in buffer
    bytes adjacent to the one currently involved in an async IO.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to BGB on Sat Dec 16 22:56:10 2023
    BGB <[email protected]> schrieb:
    On 12/16/2023 12:04 PM, moi wrote:
    On 16/12/2023 07:22, Niklas Holsti wrote:
    On 2023-12-16 0:39, Scott Lurndal wrote:
    [email protected] (MitchAlsup) writes:

        [snip]

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>> 10,000 lines of code per second for an IBM-like minicomputer (less
    decimal
    and string) and did a pretty good job of spitting out high performance >>>>> code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB).


    Are you both sure that those numbers are really lines per *second*?
    They seem improbably high, and compilation speeds in those years used
    to be stated in lines per *minute*.


    Almost certainly per minute.
    I worked on a compiler in 1975 that ran on the most powerful ICL 1900.
    It achieved 20K cards per minute and was considered to be very fast.


    Lines per minute seems to make sense.


    Modern PC's are orders of magnitude faster, but still don't have
    "instant" compile times by any means.

    Could be faster though, but would likely need languages other than C or (especially) C++.

    I assume you never worked with Turbo Pascal.

    That was amazing. It compiled code so fast that it was never a
    bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
    The first version I ever used, 3.0 (?) compiled from memory to
    memory, so even slow I/O (to floppy disc, at the time) was not
    an issue.

    This was made possible by using a streamlined one-pass compiler. It
    didn't do much optimization, but when the alternative was BASIC, the
    generated code was still extremely fast by comparision.

    There were a few drawbacks. The biggest one was that programming errors
    tended to freeze the machine. Another (not so important) was that,
    if you were one of the lucky people to have an 80x87 coprocessor, the
    generated code did not check for overflow of the coprocessor stack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Quadibloc@21:1/5 to Stephen Fuld on Sat Dec 16 23:40:54 2023
    On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld wrote:

    Anyway, has all development stopped? Or is their "sweat equity" model
    still going on?

    I've checked the Mill web site, and Ivan Godard last posted to the forums
    there just five days ago. So I can only assume that all is well, but
    perhaps he has entered a phase of work on the Mill that is keeping him
    busy. Which would seem to be good news.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to BGB-Alt on Sun Dec 17 00:24:58 2023
    BGB-Alt <[email protected]> writes:
    On 12/16/2023 1:25 PM, EricP wrote:
    MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they >>>>>>>> exist (also this appears to be around the cutoff point for the >>>>>>>> free version of Vivado as well; but one would have thought Xilinx >>>>>>>> would have already gotten their money by someone having bought >>>>>>>> the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day.   The >>>>>> FPGA
    cost is in the noise.

    For a hobby?  Well...


    If the compiler is kept smaller, it is faster to recompile from >>>>>>>> source.

    In 1979 I joined a company with a FORTRAN mostly-77- that compiled >>>>>>> at 10,000 lines of code per second for an IBM-like minicomputer
    (less decimal and string) and did a pretty good job of spitting
    out high performance
    code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB).  But in both cases, >>>>>> the languages were far simpler and much easier to generate efficient >>>>>> code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be
    enough to fit:
       A basic kernel;
         (this excludes the Linux kernel, which is well over the size >>>>>>>> limit).

    If there were an efficient way to run the device driver sack in
    user-mode
    without privilege and only the MMI/O pages this driver can touch >>>>>>> mapped
    into his VAS. Poof none of the driver stack is in the kernel.  --IF-- >>>>>
    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature.    When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware.    Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer
    the user provided, OH so long ago ??


    Because the device wants to do DMA directly into or from the users
    virtual address space.   Bulk transfer, not MMIO accesses.

    OK, I will ask the question in the contrapositive way::
    If the user ask device to read into a buffer, why does the device get
    to see everything of the user's space along with that buffer ?

    The way you write you are assuming the device can write into the
    user's code space when he ask for a read from one of his buffers !?!

    You _could_ give device translations to anything and everything
    in user space, but this seems excessive when the user only wants
    the device to read/write small area inside his VaS.

    OS code already has to manipulate PTE entries or MMU tables so
    the device can write read-only and execute-only pages along with
    removing write-permission on a page with data inbound from a device.

    The OS can't remove the page RW access for a user mode page while an
    IO device is DMA writing the page, if that's what you meant,
    as the DMA-in may be writing to a smaller buffer within a larger page.
    It is perfectly normal for a thread to continue to work in buffer
    bytes adjacent to the one currently involved in an async IO.


    One thing I don't get here is why there would be direct DMA between
    userland and the device (at least for filesystem and similar).

    https://www.dpdk.org/
    https://opendataplane.org/

    Are two very common use cases for usermode drivers.


    Like, say, for a filesystem, it is presumably:
    read syscall from user to OS;
    route this to the corresponding VFS driver;
    Requests spanning multiple blocks being broken up into parts;
    VFS driver checks the block-cache / buffer-cache;
    If found, copy from cache into user-space;
    If not found, send request to the underlying block device;
    Wait for response (and/or reschedule task for later);
    Copy result back into userland.

    No, it would be for the user mode application to access
    disk/ssd/nvme blocks directly and impose whatever structure on those
    blocks that it wishes. No OS intervention at all, DMA directly
    into userspace instead of bouncing through kernel.

    The NVME controllers use a command ring, and when virtualized,
    each VF provides a command ring directly to the user mode
    application - the application can insert commands (read, write,
    erase, etc) into the ring, write to the doorbell register
    a and wait for completion by polling or waiting for a virtio
    interrupt.

    Again the application is just reading blocks and interpreting
    them any way it wishes (e.g. for a database application
    which doesn't need a filesystem).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andreas Eder@21:1/5 to BGB on Sun Dec 17 12:23:52 2023
    On Fr 15 Dez 2023 at 13:05, BGB <[email protected]> wrote:

    Also, it would be nice to have a basically usable OS and core software
    stack in under 1M lines.

    Say, by not trying to be everything to everyone, and limiting how much
    is allowed in the core OS (or is allowed within the build process for
    the core OS).

    Though, within moderate limits, 1M lines would basically be enough to fit:
    A basic kernel;
    (this excludes the Linux kernel, which is well over the size limit).
    A (moderate sized) C compiler;
    (but not GCC, which is also well over this size limit).
    A shell+utils comparable to BusyBox;
    Various core OS libraries and similar, etc.

    For this, will assume an at least nominally POSIX like environment.

    Programs that run on the OS would not be counted in the line-count budget.

    Have you had a look at plan9 yet?

    'Andreas

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to BGB-Alt on Sun Dec 17 13:12:20 2023
    BGB-Alt wrote:
    On 12/16/2023 1:25 PM, EricP wrote:
    MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    BGB wrote:

    For FPGA's over $1k, almost makes more sense to ignore that they >>>>>>>> exist (also this appears to be around the cutoff point for the >>>>>>>> free version of Vivado as well; but one would have thought
    Xilinx would have already gotten their money by someone having >>>>>>>> bought the FPGA?...).

    For anyone serious, an verif engineer can cost $500-1000/day.
    The FPGA
    cost is in the noise.

    For a hobby? Well...


    If the compiler is kept smaller, it is faster to recompile from >>>>>>>> source.

    In 1979 I joined a company with a FORTRAN mostly-77- that
    compiled at 10,000 lines of code per second for an IBM-like
    minicomputer (less decimal and string) and did a pretty good job >>>>>>> of spitting out high performance
    code; on a machine with a 150ns cycle time.

    As did our COBOL compiler (which ran in 50KB). But in both cases, >>>>>> the languages were far simpler and much easier to generate efficient >>>>>> code than languages like Modula, Pascal, C, et alia.

    Though, within moderate limits, 1M lines would basically be
    enough to fit:
    A basic kernel;
    (this excludes the Linux kernel, which is well over the
    size limit).

    If there were an efficient way to run the device driver sack in
    user-mode
    without privilege and only the MMI/O pages this driver can touch >>>>>>> mapped
    into his VAS. Poof none of the driver stack is in the kernel.
    --IF--

    That's actually quite common and one of the raison d'etre of the
    PCI Express SR-IOV feature. When you can present a virtual
    function to the user directly (mapping the MMIO region into
    the user mode virtual address space) the app had direct access
    to the hardware. Interrupts are the only tricky part, and
    the kernel virtio subsystem, which interfaces with the user
    application via shared memory provides interrupt handling
    to the application.

    An I/OMMU provides memory protection for DMA operations initiated
    by the virtual function ensuring it only accesses the application
    virtual address space.

    Why should device be able to access user VaS outside of the buffer
    the user provided, OH so long ago ??


    Because the device wants to do DMA directly into or from the users
    virtual address space. Bulk transfer, not MMIO accesses.

    OK, I will ask the question in the contrapositive way::
    If the user ask device to read into a buffer, why does the device get
    to see everything of the user's space along with that buffer ?

    The way you write you are assuming the device can write into the
    user's code space when he ask for a read from one of his buffers !?!

    You _could_ give device translations to anything and everything
    in user space, but this seems excessive when the user only wants
    the device to read/write small area inside his VaS.

    OS code already has to manipulate PTE entries or MMU tables so
    the device can write read-only and execute-only pages along with
    removing write-permission on a page with data inbound from a device.

    The OS can't remove the page RW access for a user mode page while an
    IO device is DMA writing the page, if that's what you meant,
    as the DMA-in may be writing to a smaller buffer within a larger page.
    It is perfectly normal for a thread to continue to work in buffer
    bytes adjacent to the one currently involved in an async IO.


    One thing I don't get here is why there would be direct DMA between
    userland and the device (at least for filesystem and similar).

    Zero-copy IO. That has always been available on WinNT provided hardware supports it. General byte-buffer IO could always do zero-copy DMA,
    with HW support. For files one can do IO direct to a user buffer with
    certain restrictions, buffers must be file block size and alignment.
    I haven't checked but guessing that if the file block is already in file
    cache it gets copied, otherwise it DMA's directly to/from the user buffer. Normally one wants cached file blocks but there are times when one doesn't
    and wants the more optimal direct buffer IO (eg, a video player).

    There is also scatter-gather IO, intended for network cards,
    where the IO is a list of byte sized and aligned virtual buffers.

    The all interacts with DMA and page management because the physical
    page frames that contain the bytes must be pinned in memory for the
    duration of the DMA IO. A single virtual buffer becomes a list of
    physical fragments, so a scatter-gather list becomes a list of lists
    of physical byte buffer fragments, called a Memory Descriptor List (MDL)
    in Windows.

    And then SR-IOV adds virtual machines to the mix, where a guest OS
    physical address becomes a hypervisor guest virtual address,
    and not only are guest buffers in guest user space, but the guest OS
    MDL's are themselves in hypervisor virtual space and require their own hypervisor MDL's (lists of lists of lists of fragments).


    Like, say, for a filesystem, it is presumably:
    read syscall from user to OS;
    route this to the corresponding VFS driver;
    Requests spanning multiple blocks being broken up into parts;
    VFS driver checks the block-cache / buffer-cache;
    If found, copy from cache into user-space;
    If not found, send request to the underlying block device;
    Wait for response (and/or reschedule task for later);
    Copy result back into userland.

    Yes, pretty much (there is page mangement, quota management).
    Except if I request a direct IO it DMA's direct to/from the user buffer,
    if hardware supports that.

    Though, it may make sense that if a request isn't available immediately,
    and there is some sort of DMA mechanism, the OS could block the task and
    then resume it once the data becomes available. For polling IO, doesn't likely make much difference as the CPU is basically stuck in a busy loop either way until the IO finishes.

    Yes, that's DMA resource management. Basically each system has a certain
    number of scatter-gather IO mappers, now implemented by the IOMMU page table. Each IO queues a request for its mappers, and the DMA resource manager doles out a set of IO mapping registers, which may be less that you requested
    in which case you break up your IO into multiple requests.
    Then you program the scatter-gather map using info from the IO's MDL,
    pass the mapped IO space addresses to the device, and Bob's your uncle.
    When the IO completes, your driver tears down its IO map and releases
    the mapping registers to the next waiting IO.

    Though, could make sense for hardware accelerating pixel-copying
    operations for a GUI.

    On Windows the Gui is managed completely differently.
    I'm not familiar enough with the details to comment other than to say
    it is executed as privileged subroutines by the calling thread but in
    super mode, which allows it direct access to the calling virtual space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to EricP on Sun Dec 17 19:24:09 2023
    EricP <[email protected]> writes:
    BGB-Alt wrote:
    On 12/16/2023 1:25 PM, EricP wrote:
    MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:



    One thing I don't get here is why there would be direct DMA between
    userland and the device (at least for filesystem and similar).

    Zero-copy IO. That has always been available on WinNT provided hardware >supports it. General byte-buffer IO could always do zero-copy DMA,
    with HW support. For files one can do IO direct to a user buffer with
    certain restrictions, buffers must be file block size and alignment.
    I haven't checked but guessing that if the file block is already in file >cache it gets copied, otherwise it DMA's directly to/from the user buffer. >Normally one wants cached file blocks but there are times when one doesn't >and wants the more optimal direct buffer IO (eg, a video player).

    There is also scatter-gather IO, intended for network cards,
    where the IO is a list of byte sized and aligned virtual buffers.

    The all interacts with DMA and page management because the physical
    page frames that contain the bytes must be pinned in memory for the
    duration of the DMA IO.

    PCI express has an optional feature, PRI (Page Request Interface)
    that allows the hardware to request that a page be 'pinned' just
    for the duration of a DMA operation. The ARM64 server base system architecture document requires that the host support PRI. This
    works in conjunction with PCIe ATS (Address Translation Services)
    which allows the endpoint device to ask the host for translations
    and cache them in the endpoint so the endpoint can use physical
    addresses directly. This is usually implemented by the IOMMU
    on the host treating the endpoint as if it had a remote TLB cache.

    A single virtual buffer becomes a list of
    physical fragments, so a scatter-gather list becomes a list of lists
    of physical byte buffer fragments, called a Memory Descriptor List (MDL)
    in Windows.

    And then SR-IOV adds virtual machines to the mix,

    Not necessarily just virtual machines - it's also used
    to expose the virtual function to user mode code in
    a bare metal (or virtualized) operating system.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Thomas Koenig on Mon Dec 18 12:11:27 2023
    Thomas Koenig wrote:
    BGB <[email protected]> schrieb:
    Modern PC's are orders of magnitude faster, but still don't have
    "instant" compile times by any means.

    Could be faster though, but would likely need languages other than C or
    (especially) C++.

    I assume you never worked with Turbo Pascal.

    I was going to bring up TP but you beat me to it. :-)

    That was amazing. It compiled code so fast that it was never a
    bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
    The first version I ever used, 3.0 (?) compiled from memory to
    memory, so even slow I/O (to floppy disc, at the time) was not
    an issue.

    TP1.0 was an executable which in ~37KB managed to fit an IDE, compiler/linker/loader/debugger and RTL, and if you abstained form
    getting human readable error messages you could save about 1.5KB.

    This was made possible by using a streamlined one-pass compiler. It
    didn't do much optimization, but when the alternative was BASIC, the generated code was still extremely fast by comparision.

    That compiler had zero optimation, it was a pure pattern match->emit
    code engine that would reload the same variable from RAM on every
    statement, but as you said, still far faster than the alternatives.

    When speed was an actual issue I would switch to (inline) assembler,
    even though that was initially just a way to embed machine code directly
    so I had to assemble it in DEBUG.

    There were a few drawbacks. The biggest one was that programming errors tended to freeze the machine. Another (not so important) was that,
    if you were one of the lucky people to have an 80x87 coprocessor, the generated code did not check for overflow of the coprocessor stack.

    The fp code generated by TP would never overflow the 87 stack afair,
    since it would do single operations and pop the results at once?

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Mon Dec 18 15:43:52 2023
    "Paul A. Clayton" <[email protected]> writes:
    On 12/17/23 2:24 PM, Scott Lurndal wrote:
    EricP <[email protected]> writes:
    [snip zero-copy and scatter-gather I/O]
    The all interacts with DMA and page management because the physical
    page frames that contain the bytes must be pinned in memory for the
    duration of the DMA IO.

    PCI express has an optional feature, PRI (Page Request Interface)
    that allows the hardware to request that a page be 'pinned' just
    for the duration of a DMA operation. The ARM64 server base system
    architecture document requires that the host support PRI. This
    works in conjunction with PCIe ATS (Address Translation Services)
    which allows the endpoint device to ask the host for translations
    and cache them in the endpoint so the endpoint can use physical
    addresses directly. This is usually implemented by the IOMMU
    on the host treating the endpoint as if it had a remote TLB cache.

    Interesting. I had proposed some years ago that rather than
    pinning a physical page for I/O a page be provided when needed
    from a free list (including that the data could be cached/buffered
    with a virtual address tag).

    In most usage cases, the page being DMA'd from/to has other
    unrelated data in it, rather than being fully dedicated to
    a single buffer or set of buffers.

    The PRI is more about making sure the OS makes the page present
    before the DMA operation begins and ensuring that it won't go
    away before the DMA operation ends.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Scott Lurndal on Mon Dec 18 13:59:34 2023
    Scott Lurndal wrote:
    EricP <[email protected]> writes:
    BGB-Alt wrote:
    On 12/16/2023 1:25 PM, EricP wrote:
    MitchAlsup wrote:
    Scott Lurndal wrote:

    [email protected] (MitchAlsup) writes:
    Scott Lurndal wrote:


    One thing I don't get here is why there would be direct DMA between
    userland and the device (at least for filesystem and similar).
    Zero-copy IO. That has always been available on WinNT provided hardware
    supports it. General byte-buffer IO could always do zero-copy DMA,
    with HW support. For files one can do IO direct to a user buffer with
    certain restrictions, buffers must be file block size and alignment.
    I haven't checked but guessing that if the file block is already in file
    cache it gets copied, otherwise it DMA's directly to/from the user buffer. >> Normally one wants cached file blocks but there are times when one doesn't >> and wants the more optimal direct buffer IO (eg, a video player).

    There is also scatter-gather IO, intended for network cards,
    where the IO is a list of byte sized and aligned virtual buffers.

    The all interacts with DMA and page management because the physical
    page frames that contain the bytes must be pinned in memory for the
    duration of the DMA IO.

    PCI express has an optional feature, PRI (Page Request Interface)
    that allows the hardware to request that a page be 'pinned' just
    for the duration of a DMA operation. The ARM64 server base system architecture document requires that the host support PRI. This
    works in conjunction with PCIe ATS (Address Translation Services)
    which allows the endpoint device to ask the host for translations
    and cache them in the endpoint so the endpoint can use physical
    addresses directly. This is usually implemented by the IOMMU
    on the host treating the endpoint as if it had a remote TLB cache.

    I don't know how one would make use of that on Windows as it completely separates the IO off so that the OS can switch to a different process
    address space while the DMA takes place. The data structures to support
    paging might not be easily accessible which would introduce long latency
    in the middle of a DMA - which is exactly why it doesn't do this.
    (I don't think Linux allows paging inside the OS or drivers either.)

    On Windows you can have paging while managing a device if you put
    the driver code in either a privileged user or super mode thread,
    and then you deal with any timing issues.
    The old floppy driver worked this way - as an OS thread.
    But that was a very slow device and used programmed IO not DMA.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Tue Dec 19 14:28:10 2023
    "Paul A. Clayton" <[email protected]> writes:
    On 12/18/23 12:39 PM, MitchAlsup wrote:
    [snip page pinning for DMA]
    Guest OS can pin a guest physical page, but HyperVisor decides
    if the page is present or absent in memory.

    Out of curiosity, what happens when an I/O device tries to DMA to
    a page which the OS thinks is pinned.

    The I/O device simple pushes data to the physical address. It's
    the responsibility of the operating software to ensure the
    physical address given to the device (either via ATS where the
    device hosts the "tlb" or via the IOMMU) is correct and legal.

    If the IOMMU translation tables mark the page as absent, an error response
    will be returned to the device. If ATS was used, and the
    host didn't invalidate the translation at the host, the
    device will DMA to the specified physical address regardless
    of whether it is the correct page.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Paul A. Clayton on Tue Jan 2 23:39:36 2024
    "Paul A. Clayton" <[email protected]> writes:
    On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:

    On 12/18/23 12:39 PM, MitchAlsup wrote:
    [snip page pinning for DMA]
    Guest OS can pin a guest physical page, but HyperVisor decides
    if the page is present or absent in memory.

    Out of curiosity, what happens when an I/O device tries to DMA to
    a page which the OS thinks is pinned. I would *guess* that a DMA
    operation that fails for an unvirtualized I/O device merely
    presents an error.

    If the page fault occurs in the level 1 table, Guest OS gets a
    device page fault exception, if it happens in the level 2 table
    HyperVisor gets a device page fault exception.

    If the device can recover from page faults, the proper supervisor
    "does OS stuff" and then signals the device to proceed with the
    still pending device request. The "does OS stuff" does for the I/O
    device pretty much what the proper supervisor does with a
    CPU page fault--with all the nuances and idiosyncrasies (or more.)

    If the HV encounters a device that cannot handle a page fault for
    a page that it decided not to allocate but the OS did (knowing
    that that specific device could not handle page faults), what
    error status is sent to the OS? The HV cannot simply pass along a
    "page fault" error because the OS _knows_ that the page was
    allocated; that would break pure virtualization and potentially
    seriously confuse the OS if virtualization was not considered as a >possibility (e.g., the OS might assume the device had either a
    transient or persistent error that caused the wrong error type to
    be returned, confirm it as persistent after the second encounter,
    and mark the device as broken).

    If the HV is allowing direct access to the device, and allowing
    the device to use physical addresses via cached translations,
    then the device must support both PCIe ATS and PRI. The former
    handles the translations and the later requests that a page
    be "pinned" for a subsequent DMA operation.

    The HV controls the IOMMU which provides both the ATS and PRI interfaces
    to the device. So the HV can invalidate a translation held in the
    device (for ATS) or refuse to pin a page (or unpin a page).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)