On 7/24/2024 3:37 PM, MitchAlsup1 wrote:
Just before Google Groups got spammed to death; I wrote::
--------------------------------------------------------
MitchAlsup
Nov 1, 2022, 5:53:02 PM
In a thread called "Arguments for a Sane Instruction Set Architecture"
Aug 7, 2017, 6:53:09 PM I wrote::
-----------------------------------------------------------------------
Looking back over my 40-odd year career in computer architecture,
I thought I would list out the typical errors I and others have
made with respect to architecting computers. This is going to be
a bit long, so bear with me:
When the Instruction Set architecture is Sane, there is support
for:
A) negating operands prior to an arithmetic calculation.
Not seen many doing this, and might not be workable in the general case.
Might make sense for FPU ops like FADD/FMUL.
Maybe 'ADD'. Though, "-(A+B)" is the only case that can't be expressed
with traditional ADD/SUB/RSUB.
B) providing constants from the instruction stream;
..where constant can be an immediate a displacement, or both.
Probably true.
My ISA allows for Immediate or Displacement to be extended, but doesn't currently allow (in the base ISA) any instructions that can encode both
an immediate and displacement.
At present:
Baseline allows Imm33s/Disp33s via a 64-bit encoding;
There is optional support for Imm57s, which in XG2 is now extended to
Imm64.
There are special cases that allow immediate encodings for many
instructions that would otherwise lack an immediate encoding.
C) exact floating point arithmetics that get the Inexact flag
..correctly unmolested.
Dunno. I suspect the whole global FPU status/control register thing
should probably be rethought somehow.
But, off-hand, don't know of a clearly better alternative.
D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read. Control transfer back to the suspended thread should
..be no longer than the control transfer to the exception handler.
Likely expensive...
Granted, "glorified branch with some twiddling" is probably a little too
far in the other direction. Interrupt and syscall overhead is fairly
high when the handler needs to manually save and restore all the
registers each time.
A fast, but more expensive, option would be to have multiple copies of
the register file which is then bank-switched on an interrupt.
One possibility here could be, rather than being hard-wired to specific modes, there are 4 assignable register banks controlled by 2 status
register bits.
Then, say:
0: User Task 1
1: User Task 2
2: Reserved for Kernel / Syscall Task;
3: Reserved for interrupts.
Possibly along with instructions to move between the banked registers
and the currently active register file.
Though, likely cost would be that it would require putting the GPR
register file in Block-RAM and possibly needing to increase pipeline
length.
In an OS, the syscall and interrupt bank would likely be assigned
statically, and the others could be assigned dynamically by the
scheduler (though, as-is, would likely increase task-switch overhead vs
the current mechanism).
This situation could potentially be "better" if there were 8 dynamic
banks, with the scheduler potentially able to be clever and reuse banks
if they haven't been evicted and the same process is run again (but
could otherwise reassign them round-robin or similar).
Though, can note that as-is, in my case, in some programs, system call overhead is high enough that all this could be worth looking into (Say:
Quake 3 manages to spend nearly 3% of the clock-cycle budget in the
SYSCALL ISR; mostly saving/restoring registers).
E) Exception control transfer can transfer control directly to a
..user privilege thread without taking an excursion through the
..Operating System.
? Putting the scheduler in hardware?...
Could make sense for a microcontroller, but less so for a conventional
OS as pretty much the only things handling interrupts are likely to be supervisor-mode drivers.
F) upon arrival at an exception handler, no state needs to be saved,
..and the "cause" of the exception is immediately available to the
..Exception handler.
G) Atomicity over a multiplicity of instructions and over a
..multiplicity of memory locations--without losing the
..illusion of real atomicity.
Memory consistency is hard...
H) Elementary Transcendental function are first class citizens of
..the instruction set, and at least faithfully accurate and perform
..at the same speeds as SQRT and DIV.
.... Yeah...
In my case, they don't exist, and FDIV and FSQRT are basically boat
anchors.
Well, I guess it could be possible to support them in the ISA if they
were all boat anchors.
Say:
FSIN Rm, Rn
Raises an TRAPFPU exception, whereupon the exception handler decodes the instruction and performs the FSIN operation.
I) The "system programming model" is inherently:
..1) Virtual Machine
..2) Hypervisor + Supervisor
..3) multiprocessor, multithreaded
If the system-mode architecture is low-level enough, the difference
between normal OS functionality and emulation starts to break down.
Like, in both cases one has:
Software page table walking;
Needing to keep track of a virtual model of the TLB;
J) Simple applications can run with as little as 1 page of Memory
..Mapping overhead. An application like 'cat' can be run with
..an total allocated page count of 6: {MMU, Code, Data, BSS, Stack,
..and Register Files}
Hmm.
I guess one could make a case for a position-independent version of an "a.out" like format, focused on low-footprint binaries.
On 7/25/2024 1:09 PM, BGB wrote:
At least with a weak model, software knows that if it doesn't go through
the rituals, the memory will be stale.
The weak model is ideal for me. I know how to program for it
and it's more efficient
and sometimes use cases do not care if they encounter "stale" data.
"Chris M. Thomasson" <[email protected]> writes:
On 7/25/2024 1:09 PM, BGB wrote:
At least with a weak model, software knows that if it doesn't go through >>> the rituals, the memory will be stale.
There is no guarantee of staleness, only a lack of stronger ordering guarantees.
The weak model is ideal for me. I know how to program for it
And the fact that this model is so hard to use that few others know
how to program for it make it ideal for you.
and it's more efficient
That depends on the hardware.
Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided precise exceptions (although they did not admit this) and more
efficieny.
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware.
That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.
By contrast, one can design hardware for strong ordering such that the slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.
and sometimes use cases do not care if they encounter "stale" data.
Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.
- anton
On 7/25/24 6:07 PM, MitchAlsup1 wrote:
On Thu, 25 Jul 2024 20:09:06 +0000, BGB wrote:[snip]
On 7/24/2024 3:37 PM, MitchAlsup1 wrote:
[snip]D) exception and interrupt control transfer should take no more
..than 1 cache line read followed by 4 cache line reads to the
..same page in DRAM/L3/L2 that are dependent on the first cache
..line read. Control transfer back to the suspended thread should
..be no longer than the control transfer to the exception handler.
A fast, but more expensive, option would be to have multiple
copies of
the register file which is then bank-switched on an interrupt.
Under My 66000 a low end implementation can choose the write back
cache
version, while the GBOoO implementation can choose the bank switcher.
In both cases, the same model is presented to executing SW.
I do not know at what port count a "3D register file" (temporal
banking where extra storage "hides" under the wires) makes sense.
I suspect the 3-read, 1-write register file of a low end My 66000 implementation would have the overhead be too great unless lower
overhead context switching was extremely important.
On 7/26/2024 12:00 PM, Anton Ertl wrote:
"Chris M. Thomasson" <[email protected]> writes:
and it's more efficient
That depends on the hardware.
Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided
precise exceptions (although they did not admit this) and more
efficieny.
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware. That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.
TSO requires more significant hardware complexity though.
Seems like it would be harder to debug the hardware since:
There is more that has to go on in the hardware for TSO to work;
Software will have higher expectations that it actually work.
On 7/26/2024 10:00 AM, Anton Ertl wrote:
"Chris M. Thomasson" <[email protected]> writes:
On 7/25/2024 1:09 PM, BGB wrote:
At least with a weak model, software knows that if it doesn't go through >>>> the rituals, the memory will be stale.
There is no guarantee of staleness, only a lack of stronger ordering
guarantees.
The weak model is ideal for me. I know how to program for it
And the fact that this model is so hard to use that few others know
how to program for it make it ideal for you.
and it's more efficient
That depends on the hardware.
Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided
precise exceptions (although they did not admit this) and more
efficieny.
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware. That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.
By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.
and sometimes use cases do not care if they encounter "stale" data.
Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.
A strong model? You mean I don't have to use any memory barriers at all?
Tell that to SPARC in RMO mode... How strong? Even the x86 requires a
membar when a store followed by a load to another location shall be
respected wrt order. Store-Load. #StoreLoad over on SPARC. ;^)
If you can force everything to be #StoreLoad (*) and make it faster than
a handcrafted algo on a very weak memory system, well, hats off! I
thought it was easier for a HW guy to implement weak consistency? At the
cost of the increased complexity wrt programming the sucker! ;^)
(*) Not just #StoreLoad for full consistency, you would need :
MEMBAR #StoreLoad | #LoadStore | #StoreStore | #LoadLoad
right?
Otherwise, stuff isn't going to fit into the FPGAs.
Something like TSO is a lot of complexity for not much gain.
Contrast, floating point and precise exceptions are a lot more relevant
to software.
However... There is "special" mutex logic that actually requires a #StoreLoad! Peterson's algorithm for example. Iirc, it needs a #StoreLoad because it depends on a store followed by a load to another location to hold true. This is a bit different thanother locking algorithms...
There there are more "exotic" methods such as so-called asymmetric mutexes. They can have fast paths and slow paths, so to speak. It's almost getting into the realm of RCU here... A fast path can be memory barrier free. The slow path can make thingsconsistent with the use of so called "remote" memory barriers. It's funny that Windows seems to have one:
https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
;^)
The slow path is meant to not be frequently used, hence the term asymmetric. On par with read/write logic... :^)
On 7/30/2024 12:59 PM, jseigh wrote:from SUN or something was trying to claim your atomic_ptr logic? Iirc, we talked about it back on comp.arch, a long time ago...
The folly library hazard pointers use that on windows, membarrier() system call on linux (something else on older linuxes), to get rid of the expensive store/load memory barrier in hazard pointer loads.
I need to check that out; thanks for the heads up. Fwiw, remember that old thread on comp.programming.threads where you first'ish published your ideas on RCU+SMR? I need to see if the folly library references your work. Also, remember when some paper
I remember you issued a "challenge like" post over on comp.programming.threads wrt detecting quiescent periods. Iirc, I was the first one to comment wrt a possible hackish solution using timing wrt kernel time. ;^)
Something like 0.7 nsecs w/o membar vs 7.7 w/ membar. The term I've seen being used now is asymmetric memory barrier.
Big time! This is bringing back a lot of memories Joe. :^) Thanks.
Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the rasterizer
module. It has its own internal caches that need to be flushed, and not >flushing caches (between this module and CPU) when trying to "transfer" >control over things like the framebuffer or Z-buffer, can result in
obvious graphical issues (and, texture-corruption doesn't necessarily
look good either).
BGB <[email protected]> writes:
Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the rasterizer >>module. It has its own internal caches that need to be flushed, and not >>flushing caches (between this module and CPU) when trying to "transfer" >>control over things like the framebuffer or Z-buffer, can result in
obvious graphical issues (and, texture-corruption doesn't necessarily
look good either).
The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable
memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.
BGB <[email protected]> writes:
Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the
rasterizer module. It has its own internal caches that need to be
flushed, and not flushing caches (between this module and CPU) when
trying to "transfer" control over things like the framebuffer or
Z-buffer, can result in obvious graphical issues (and,
texture-corruption doesn't necessarily look good either).
The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable
memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.
- anton
[email protected] (Anton Ertl) writes:
BGB <[email protected]> writes:
Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the rasterizer >>>module. It has its own internal caches that need to be flushed, and not >>>flushing caches (between this module and CPU) when trying to "transfer" >>>control over things like the framebuffer or Z-buffer, can result in >>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>look good either).
The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable
memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.
In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).
On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:
[email protected] (Anton Ertl) writes:
BGB <[email protected]> writes:
Some amount of the cases where consistency issues have come up in my >>>>case have do do with RAM-backed hardware devices, like the rasterizer >>>>module. It has its own internal caches that need to be flushed, and not >>>>flushing caches (between this module and CPU) when trying to "transfer" >>>>control over things like the framebuffer or Z-buffer, can result in >>>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>>look good either).
The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable >>>memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.
In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).
Sorry, I don't understand the word 'allocate' ?!?
So, what definition does ARM apply to 'allocate' ??
[email protected] (MitchAlsup1) writes:
On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:
[email protected] (Anton Ertl) writes:
BGB <[email protected]> writes:
Some amount of the cases where consistency issues have come up in my >>>>>case have do do with RAM-backed hardware devices, like the rasterizer >>>>>module. It has its own internal caches that need to be flushed, and not >>>>>flushing caches (between this module and CPU) when trying to "transfer" >>>>>control over things like the framebuffer or Z-buffer, can result in >>>>>obvious graphical issues (and, texture-corruption doesn't necessarily >>>>>look good either).
The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable >>>>memory. For a frame buffer that is read by some hardware that can >>>>access the memory independently, write-through seems to be the way to >>>>go.
In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).
Sorry, I don't understand the word 'allocate' ?!?
"allocate a cache line".
Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.
Used when software expects the DMA data to be immediately.
"no allocate" for CPU initiated stores/loads would be equivalent
to write-through-but-do-not-allocate-a-line-for-it (e.g
non-temporal stores/loads). There are instructions for
individual N/T accesses, but with the region attributes it
can be applied to normal loads/stores for a whole page
or set of pages.
On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
On Thu, 1 Aug 2024 17:39:24 +0000, Scott Lurndal wrote:
[email protected] (Anton Ertl) writes:
BGB <[email protected]> writes:
Some amount of the cases where consistency issues have come up
in my case have do do with RAM-backed hardware devices, like the >>>>>rasterizer module. It has its own internal caches that need to
be flushed, and not flushing caches (between this module and
CPU) when trying to "transfer" control over things like the >>>>>framebuffer or Z-buffer, can result in obvious graphical issues >>>>>(and, texture-corruption doesn't necessarily look good either).
The approach taken on AMD64 CPUs is to have different memory types >>>>(and associated memory type range registers). Plain DRAM is >>>>write-back cached, but there is also write-through and uncacheable >>>>memory. For a frame buffer that is read by some hardware that can >>>>access the memory independently, write-through seems to be the
way to go.
In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read
allocate', 'write allocate' as well has having optionally
multiple coherency domains (inner and outer sharable).
Sorry, I don't understand the word 'allocate' ?!?
"allocate a cache line".
Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.
Used when software expects the DMA data to be immediately.
Thanks for the explanation.
In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.
Meanwhile, cores (or other devices) can access it directly
from LLC as if it were from DRAM except at lower latency.
On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:
In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).
Sorry, I don't understand the word 'allocate' ?!?
"allocate a cache line".
Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.
Used when software expects the DMA data to be immediately.
Thanks for the explanation.
In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.
When using memove() or memset() data is moved on page sized
boundaries over the "bus".
[email protected] (MitchAlsup1) writes:
On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:
In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate',
'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).
Sorry, I don't understand the word 'allocate' ?!?
"allocate a cache line".
Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.
Used when software expects the DMA data to be immediately.
Thanks for the explanation.
In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.
I'm not sure that's a good idea. Large DMAs are common
(e.g. reading pages of data in a single I/O) and the data
from the DMA is not always used by the CPU. Evicting LLC lines to
accomodate a file copy, for example, seems less than optimal.
When using memove() or memset() data is moved on page sized
boundaries over the "bus".
IME the majority of memset calls are for relatively small
(less than a page) regions.
On Fri, 2 Aug 2024 14:05:25 +0000, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:
In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read allocate', >>>>>> 'write allocate' as well has having optionally multiple coherency
domains (inner and outer sharable).
Sorry, I don't understand the word 'allocate' ?!?
"allocate a cache line".
Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.
Used when software expects the DMA data to be immediately.
Thanks for the explanation.
In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.
I'm not sure that's a good idea. Large DMAs are common
(e.g. reading pages of data in a single I/O) and the data
from the DMA is not always used by the CPU. Evicting LLC lines to
accomodate a file copy, for example, seems less than optimal.
Fair enough. But after thinking abut this for a while, does the
process performing the file copy even know it is doing a file
copy ?? for example::
cat ../mydir/myfile > ../yourdir/yourfile
Which kind of applications know they are doing Input that will
not be used rather presently ??
On Fri, 2 Aug 2024 14:05:25 +0000, Scott Lurndal wrote:
[email protected] (MitchAlsup1) writes:
On Thu, 1 Aug 2024 20:34:28 +0000, Scott Lurndal wrote:
In addition, ARM64 CPUs include allocation hints in
the memory type such as 'read allocate', 'transient read
allocate', 'write allocate' as well has having optionally
multiple coherency domains (inner and outer sharable).
Sorry, I don't understand the word 'allocate' ?!?
"allocate a cache line".
Example would be a DMA request with the 'read allocate' hint
is allowed to be allocated in LLC instead of being stored in
DRAM.
Used when software expects the DMA data to be immediately.
Thanks for the explanation.
In my case LLC is simply the front end for DRAM so a device
write will spew data into LLC where it will wait to be written.
I'm not sure that's a good idea. Large DMAs are common
(e.g. reading pages of data in a single I/O) and the data
from the DMA is not always used by the CPU. Evicting LLC lines to accomodate a file copy, for example, seems less than optimal.
Fair enough. But after thinking abut this for a while, does the
process performing the file copy even know it is doing a file
copy ?? for example::
cat ../mydir/myfile > ../yourdir/yourfile
Which kind of applications know they are doing Input that will
not be used rather presently ??
It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.
MitchAlsup1 wrote:
It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.
I suppose you could creaate a mecnahism that fed the data from the
"read" DMA directly to the "Write DMA, thus bypassing not only the
cache, but the saving DRAM bandwidth as well. This would help on
copies, and perhaps things like defrag and backup. But I suspect that
the savings are not worth the effort.
"Stephen Fuld" <[email protected]d> writes:
MitchAlsup1 wrote:
It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.
I suppose you could creaate a mecnahism that fed the data from the
"read" DMA directly to the "Write DMA, thus bypassing not only the
cache, but the saving DRAM bandwidth as well. This would help on
copies, and perhaps things like defrag and backup. But I suspect
that the savings are not worth the effort.
It would be more logical, I think, to simply build the functionality
into the controller (when the source and destination are devices
attached to that controller (e.g. SATA, SAS or nVME)). IIRC,
that sort of functionality was available on some SCSI controllers.
For the case where devices are on multiple controllers, PCI express peer-to-peer would be the appropriate solution. There's no need
for the CPU and cache complex to be involved at all.
Shades of channel programs...
Scott Lurndal wrote:
"Stephen Fuld" <[email protected]d> writes:
MitchAlsup1 wrote:
It seems to me that a file copy application would understand
that writing of DRAM is irrelevant when the true destination
is another sector on another disk, and any means to connect
those does is more than sufficient.
I suppose you could creaate a mecnahism that fed the data from the
"read" DMA directly to the "Write DMA, thus bypassing not only the
cache, but the saving DRAM bandwidth as well. This would help on
copies, and perhaps things like defrag and backup. But I suspect
that the savings are not worth the effort.
It would be more logical, I think, to simply build the functionality
into the controller (when the source and destination are devices
attached to that controller (e.g. SATA, SAS or nVME)). IIRC,
that sort of functionality was available on some SCSI controllers.
For the case where devices are on multiple controllers, PCI express
peer-to-peer would be the appropriate solution. There's no need
for the CPU and cache complex to be involved at all.
Yes, thank you. The PCI Express option was the kind of thing I was
thinking of. Since it is more general than the "in controller option",
if you implement it at the PCI level, then you don't need the
controller option.
But even though the savings are real, given the limited use case for
the feature, I question if it is worth the trouble.
Shades of channel programs...
Not nearly as flexible as channel programs, nor with their overhead.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 11:01:16 |
| Calls: | 12,100 |
| Files: | 15,003 |
| Messages: | 6,517,990 |