When we last heard from the merry band of Millers, they were looking for >substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.
But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for
not huge amounts of money. Whether this is worthwhile, I cannot say.
When we last heard from the merry band of Millers, they were looking for >substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.
But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for
not huge amounts of money. Whether this is worthwhile, I cannot say.
Anyway, has all development stopped? Or is their "sweat equity" model
still going on?
Inquiring minds want to know.
On 12/15/2023 11:48 AM, George Neuner wrote:
On Wed, 13 Dec 2023 08:25:39 -0800, Stephen Fuld
<[email protected]d> wrote:
When we last heard from the merry band of Millers, they were looking for >>> substantial funding from a VC or similar. I suppose that if they had
gotten it, we would have heard, so I guess they haven't.
But I think there are things they could do to move forward even without
a large investment. For example, they could develop an FPGA based
system, even if it required multiple FPGAs on a custom circuit board for >>> not huge amounts of money. Whether this is worthwhile, I cannot say.
Anyway, has all development stopped? Or is their "sweat equity" model
still going on?
Inquiring minds want to know.
There was a post, ostensibly from Ivan, in their web forum just a few
days ago. No news though - just an acknowledgement of another user's
post.
Last I heard, the next (current?) round of financing was - at least in
part - to be used for FPGA "proof of concept" implementations.
Problem is the Mill really is a SoC, and (to me at least) the design
appears to be so complex that it would require a large, top-of-line
(read "expensive") FPGA to fit all the functionality.
Yeah. the lower end isn't cheap, the upper end is absurd...
For FPGA's over $1k, almost makes more sense to ignore that they exist
(also this appears to be around the cutoff point for the free version of Vivado as well; but one would have thought Xilinx would have already
gotten their money by someone having bought the FPGA?...).
[email protected] (MitchAlsup) writes:
In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >> and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB).
On 2023-12-16 0:39, Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
[snip]
In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >>> and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB).
Are you both sure that those numbers are really lines per *second*? They
seem improbably high, and compilation speeds in those years used to be
stated in lines per *minute*.
Niklas Holsti <[email protected]d> writes:
On 2023-12-16 0:39, Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
[snip]
In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >>>> and string) and did a pretty good job of spitting out high performance >>>> code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB).
Are you both sure that those numbers are really lines per *second*? They >>seem improbably high, and compilation speeds in those years used to be >>stated in lines per *minute*.
Especially given that 10Klines/s is probably around 500KB/s which has
to be read from disk and probably a similar amount that has to be
written to disk. What were the I/O throughputs available at the time?
On 2023-12-16 0:39, Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
[snip]
In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less decimal >>> and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB).
Are you both sure that those numbers are really lines per *second*? They
seem improbably high, and compilation speeds in those years used to be
stated in lines per *minute*.
On 2023-12-16 0:39, Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
[snip]
In 1979 I joined a company with a FORTRAN mostly-77- that compiled at
10,000 lines of code per second for an IBM-like minicomputer (less
decimal
and string) and did a pretty good job of spitting out high performance
code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB).
Are you both sure that those numbers are really lines per *second*? They
seem improbably high, and compilation speeds in those years used to be
stated in lines per *minute*.
Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
BGB wrote:
For FPGA's over $1k, almost makes more sense to ignore that they
exist (also this appears to be around the cutoff point for the
free version of Vivado as well; but one would have thought Xilinx
would have already gotten their money by someone having bought the >>>>>> FPGA?...).
For anyone serious, an verif engineer can cost $500-1000/day. The
FPGA
cost is in the noise.
For a hobby? Well...
If the compiler is kept smaller, it is faster to recompile from
source.
In 1979 I joined a company with a FORTRAN mostly-77- that compiled
at 10,000 lines of code per second for an IBM-like minicomputer
(less decimal and string) and did a pretty good job of spitting out
high performance
code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB). But in both cases,
the languages were far simpler and much easier to generate efficient
code than languages like Modula, Pascal, C, et alia.
Though, within moderate limits, 1M lines would basically be enough >>>>>> to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size
limit).
If there were an efficient way to run the device driver sack in
user-mode
without privilege and only the MMI/O pages this driver can touch
mapped
into his VAS. Poof none of the driver stack is in the kernel. --IF--
That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.
An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.
Why should device be able to access user VaS outside of the buffer
the user provided, OH so long ago ??
Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.
OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?
The way you write you are assuming the device can write into the
user's code space when he ask for a read from one of his buffers !?!
You _could_ give device translations to anything and everything
in user space, but this seems excessive when the user only wants
the device to read/write small area inside his VaS.
OS code already has to manipulate PTE entries or MMU tables so
the device can write read-only and execute-only pages along with
removing write-permission on a page with data inbound from a device.
On 12/16/2023 12:04 PM, moi wrote:
On 16/12/2023 07:22, Niklas Holsti wrote:
On 2023-12-16 0:39, Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
[snip]
In 1979 I joined a company with a FORTRAN mostly-77- that compiled at >>>>> 10,000 lines of code per second for an IBM-like minicomputer (less
decimal
and string) and did a pretty good job of spitting out high performance >>>>> code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB).
Are you both sure that those numbers are really lines per *second*?
They seem improbably high, and compilation speeds in those years used
to be stated in lines per *minute*.
Almost certainly per minute.
I worked on a compiler in 1975 that ran on the most powerful ICL 1900.
It achieved 20K cards per minute and was considered to be very fast.
Lines per minute seems to make sense.
Modern PC's are orders of magnitude faster, but still don't have
"instant" compile times by any means.
Could be faster though, but would likely need languages other than C or (especially) C++.
Anyway, has all development stopped? Or is their "sweat equity" model
still going on?
On 12/16/2023 1:25 PM, EricP wrote:
MitchAlsup wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
BGB wrote:
For FPGA's over $1k, almost makes more sense to ignore that they >>>>>>>> exist (also this appears to be around the cutoff point for the >>>>>>>> free version of Vivado as well; but one would have thought Xilinx >>>>>>>> would have already gotten their money by someone having bought >>>>>>>> the FPGA?...).
For anyone serious, an verif engineer can cost $500-1000/day. The >>>>>> FPGA
cost is in the noise.
For a hobby? Well...
If the compiler is kept smaller, it is faster to recompile from >>>>>>>> source.
In 1979 I joined a company with a FORTRAN mostly-77- that compiled >>>>>>> at 10,000 lines of code per second for an IBM-like minicomputer
(less decimal and string) and did a pretty good job of spitting
out high performance
code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB). But in both cases, >>>>>> the languages were far simpler and much easier to generate efficient >>>>>> code than languages like Modula, Pascal, C, et alia.
That's actually quite common and one of the raison d'etre of theThough, within moderate limits, 1M lines would basically be
enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size >>>>>>>> limit).
If there were an efficient way to run the device driver sack in
user-mode
without privilege and only the MMI/O pages this driver can touch >>>>>>> mapped
into his VAS. Poof none of the driver stack is in the kernel. --IF-- >>>>>
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.
An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.
Why should device be able to access user VaS outside of the buffer
the user provided, OH so long ago ??
Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.
OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?
The way you write you are assuming the device can write into the
user's code space when he ask for a read from one of his buffers !?!
You _could_ give device translations to anything and everything
in user space, but this seems excessive when the user only wants
the device to read/write small area inside his VaS.
OS code already has to manipulate PTE entries or MMU tables so
the device can write read-only and execute-only pages along with
removing write-permission on a page with data inbound from a device.
The OS can't remove the page RW access for a user mode page while an
IO device is DMA writing the page, if that's what you meant,
as the DMA-in may be writing to a smaller buffer within a larger page.
It is perfectly normal for a thread to continue to work in buffer
bytes adjacent to the one currently involved in an async IO.
One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).
Like, say, for a filesystem, it is presumably:
read syscall from user to OS;
route this to the corresponding VFS driver;
Requests spanning multiple blocks being broken up into parts;
VFS driver checks the block-cache / buffer-cache;
If found, copy from cache into user-space;
If not found, send request to the underlying block device;
Wait for response (and/or reschedule task for later);
Copy result back into userland.
Also, it would be nice to have a basically usable OS and core software
stack in under 1M lines.
Say, by not trying to be everything to everyone, and limiting how much
is allowed in the core OS (or is allowed within the build process for
the core OS).
Though, within moderate limits, 1M lines would basically be enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the size limit).
A (moderate sized) C compiler;
(but not GCC, which is also well over this size limit).
A shell+utils comparable to BusyBox;
Various core OS libraries and similar, etc.
For this, will assume an at least nominally POSIX like environment.
Programs that run on the OS would not be counted in the line-count budget.
On 12/16/2023 1:25 PM, EricP wrote:
MitchAlsup wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
BGB wrote:
For FPGA's over $1k, almost makes more sense to ignore that they >>>>>>>> exist (also this appears to be around the cutoff point for the >>>>>>>> free version of Vivado as well; but one would have thought
Xilinx would have already gotten their money by someone having >>>>>>>> bought the FPGA?...).
For anyone serious, an verif engineer can cost $500-1000/day.
The FPGA
cost is in the noise.
For a hobby? Well...
If the compiler is kept smaller, it is faster to recompile from >>>>>>>> source.
In 1979 I joined a company with a FORTRAN mostly-77- that
compiled at 10,000 lines of code per second for an IBM-like
minicomputer (less decimal and string) and did a pretty good job >>>>>>> of spitting out high performance
code; on a machine with a 150ns cycle time.
As did our COBOL compiler (which ran in 50KB). But in both cases, >>>>>> the languages were far simpler and much easier to generate efficient >>>>>> code than languages like Modula, Pascal, C, et alia.
Though, within moderate limits, 1M lines would basically be
enough to fit:
A basic kernel;
(this excludes the Linux kernel, which is well over the
size limit).
If there were an efficient way to run the device driver sack in
user-mode
without privilege and only the MMI/O pages this driver can touch >>>>>>> mapped
into his VAS. Poof none of the driver stack is in the kernel.
--IF--
That's actually quite common and one of the raison d'etre of the
PCI Express SR-IOV feature. When you can present a virtual
function to the user directly (mapping the MMIO region into
the user mode virtual address space) the app had direct access
to the hardware. Interrupts are the only tricky part, and
the kernel virtio subsystem, which interfaces with the user
application via shared memory provides interrupt handling
to the application.
An I/OMMU provides memory protection for DMA operations initiated
by the virtual function ensuring it only accesses the application
virtual address space.
Why should device be able to access user VaS outside of the buffer
the user provided, OH so long ago ??
Because the device wants to do DMA directly into or from the users
virtual address space. Bulk transfer, not MMIO accesses.
OK, I will ask the question in the contrapositive way::
If the user ask device to read into a buffer, why does the device get
to see everything of the user's space along with that buffer ?
The way you write you are assuming the device can write into the
user's code space when he ask for a read from one of his buffers !?!
You _could_ give device translations to anything and everything
in user space, but this seems excessive when the user only wants
the device to read/write small area inside his VaS.
OS code already has to manipulate PTE entries or MMU tables so
the device can write read-only and execute-only pages along with
removing write-permission on a page with data inbound from a device.
The OS can't remove the page RW access for a user mode page while an
IO device is DMA writing the page, if that's what you meant,
as the DMA-in may be writing to a smaller buffer within a larger page.
It is perfectly normal for a thread to continue to work in buffer
bytes adjacent to the one currently involved in an async IO.
One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).
Like, say, for a filesystem, it is presumably:
read syscall from user to OS;
route this to the corresponding VFS driver;
Requests spanning multiple blocks being broken up into parts;
VFS driver checks the block-cache / buffer-cache;
If found, copy from cache into user-space;
If not found, send request to the underlying block device;
Wait for response (and/or reschedule task for later);
Copy result back into userland.
Though, it may make sense that if a request isn't available immediately,
and there is some sort of DMA mechanism, the OS could block the task and
then resume it once the data becomes available. For polling IO, doesn't likely make much difference as the CPU is basically stuck in a busy loop either way until the IO finishes.
Though, could make sense for hardware accelerating pixel-copying
operations for a GUI.
BGB-Alt wrote:
On 12/16/2023 1:25 PM, EricP wrote:
MitchAlsup wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
Scott Lurndal wrote:
One thing I don't get here is why there would be direct DMA between
userland and the device (at least for filesystem and similar).
Zero-copy IO. That has always been available on WinNT provided hardware >supports it. General byte-buffer IO could always do zero-copy DMA,
with HW support. For files one can do IO direct to a user buffer with
certain restrictions, buffers must be file block size and alignment.
I haven't checked but guessing that if the file block is already in file >cache it gets copied, otherwise it DMA's directly to/from the user buffer. >Normally one wants cached file blocks but there are times when one doesn't >and wants the more optimal direct buffer IO (eg, a video player).
There is also scatter-gather IO, intended for network cards,
where the IO is a list of byte sized and aligned virtual buffers.
The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO.
A single virtual buffer becomes a list of
physical fragments, so a scatter-gather list becomes a list of lists
of physical byte buffer fragments, called a Memory Descriptor List (MDL)
in Windows.
And then SR-IOV adds virtual machines to the mix,
BGB <[email protected]> schrieb:
Modern PC's are orders of magnitude faster, but still don't have
"instant" compile times by any means.
Could be faster though, but would likely need languages other than C or
(especially) C++.
I assume you never worked with Turbo Pascal.
That was amazing. It compiled code so fast that it was never a
bother, to wait for it, even on a 8088 IBM PC running at 4.7 MHz.
The first version I ever used, 3.0 (?) compiled from memory to
memory, so even slow I/O (to floppy disc, at the time) was not
an issue.
This was made possible by using a streamlined one-pass compiler. It
didn't do much optimization, but when the alternative was BASIC, the generated code was still extremely fast by comparision.
There were a few drawbacks. The biggest one was that programming errors tended to freeze the machine. Another (not so important) was that,
if you were one of the lucky people to have an 80x87 coprocessor, the generated code did not check for overflow of the coprocessor stack.
On 12/17/23 2:24 PM, Scott Lurndal wrote:
EricP <[email protected]> writes:[snip zero-copy and scatter-gather I/O]
The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO.
PCI express has an optional feature, PRI (Page Request Interface)
that allows the hardware to request that a page be 'pinned' just
for the duration of a DMA operation. The ARM64 server base system
architecture document requires that the host support PRI. This
works in conjunction with PCIe ATS (Address Translation Services)
which allows the endpoint device to ask the host for translations
and cache them in the endpoint so the endpoint can use physical
addresses directly. This is usually implemented by the IOMMU
on the host treating the endpoint as if it had a remote TLB cache.
Interesting. I had proposed some years ago that rather than
pinning a physical page for I/O a page be provided when needed
from a free list (including that the data could be cached/buffered
with a virtual address tag).
EricP <[email protected]> writes:
BGB-Alt wrote:
On 12/16/2023 1:25 PM, EricP wrote:
MitchAlsup wrote:
Scott Lurndal wrote:
[email protected] (MitchAlsup) writes:
Scott Lurndal wrote:
One thing I don't get here is why there would be direct DMA betweenZero-copy IO. That has always been available on WinNT provided hardware
userland and the device (at least for filesystem and similar).
supports it. General byte-buffer IO could always do zero-copy DMA,
with HW support. For files one can do IO direct to a user buffer with
certain restrictions, buffers must be file block size and alignment.
I haven't checked but guessing that if the file block is already in file
cache it gets copied, otherwise it DMA's directly to/from the user buffer. >> Normally one wants cached file blocks but there are times when one doesn't >> and wants the more optimal direct buffer IO (eg, a video player).
There is also scatter-gather IO, intended for network cards,
where the IO is a list of byte sized and aligned virtual buffers.
The all interacts with DMA and page management because the physical
page frames that contain the bytes must be pinned in memory for the
duration of the DMA IO.
PCI express has an optional feature, PRI (Page Request Interface)
that allows the hardware to request that a page be 'pinned' just
for the duration of a DMA operation. The ARM64 server base system architecture document requires that the host support PRI. This
works in conjunction with PCIe ATS (Address Translation Services)
which allows the endpoint device to ask the host for translations
and cache them in the endpoint so the endpoint can use physical
addresses directly. This is usually implemented by the IOMMU
on the host treating the endpoint as if it had a remote TLB cache.
On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]
Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.
Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned.
On 12/18/23 10:40 PM, MitchAlsup wrote:> Paul A. Clayton wrote:
On 12/18/23 12:39 PM, MitchAlsup wrote:
[snip page pinning for DMA]
Guest OS can pin a guest physical page, but HyperVisor decides
if the page is present or absent in memory.
Out of curiosity, what happens when an I/O device tries to DMA to
a page which the OS thinks is pinned. I would *guess* that a DMA
operation that fails for an unvirtualized I/O device merely
presents an error.
If the page fault occurs in the level 1 table, Guest OS gets a
device page fault exception, if it happens in the level 2 table
HyperVisor gets a device page fault exception.
If the device can recover from page faults, the proper supervisor
"does OS stuff" and then signals the device to proceed with the
still pending device request. The "does OS stuff" does for the I/O
device pretty much what the proper supervisor does with a
CPU page fault--with all the nuances and idiosyncrasies (or more.)
If the HV encounters a device that cannot handle a page fault for
a page that it decided not to allocate but the OS did (knowing
that that specific device could not handle page faults), what
error status is sent to the OS? The HV cannot simply pass along a
"page fault" error because the OS _knows_ that the page was
allocated; that would break pure virtualization and potentially
seriously confuse the OS if virtualization was not considered as a >possibility (e.g., the OS might assume the device had either a
transient or persistent error that caused the wrong error type to
be returned, confirm it as persistent after the second encounter,
and mark the device as broken).
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 36:36:41 |
| Calls: | 12,109 |
| Files: | 15,006 |
| Messages: | 6,518,363 |