So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.
- anton
My 66000 has MM memmove as an instruction (4-bytes) always optimal, no checking required.
On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:
My 66000 has MM memmove as an instruction (4-bytes) always optimal, no
checking required.
Presumably interruptible and resumable ...
On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:
On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:
My 66000 has MM memmove as an instruction (4-bytes) always optimal, no
checking required.
Presumably interruptible and resumable ...
Yep; but also include able to take exceptions.
On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:
On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:
On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:Yep; but also include able to take exceptions.
My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>> checking required.Presumably interruptible and resumable ...
So you have a VAX-style “first part done” processor status bit? And you use architectural registers to save/restore the state of an instruction in progress at the time of an interrupt?
So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it'sImagine that:: almost a page for memmove entry points.
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.
My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.
So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it'sImagine that:: almost a page for memmove entry points.
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.
My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.
What is different about MM compared to `rep movsb` that you can
confidently state that it will always be optimal?
Stefan
On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:
On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:
On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:
My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.
Presumably interruptible and resumable ...
Yep; but also include able to take exceptions.
So you have a VAX-style “first part done” processor status bit? And
you use architectural registers to save/restore the state of an
instruction in progress at the time of an interrupt?
On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:
On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:
On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:
My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>> checking required.
Presumably interruptible and resumable ...
Yep; but also include able to take exceptions.
So you have a VAX-style “first part done” processor status bit?
And you
use architectural registers to save/restore the state of an instruction
in progress at the time of an interrupt?
On Thu, 13 Mar 2025 02:34:11 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:
On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:
On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:
On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:
My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.
Presumably interruptible and resumable ...
Yep; but also include able to take exceptions.
So you have a VAX-style “first part done” processor status bit? And
you use architectural registers to save/restore the state of an
instruction in progress at the time of an interrupt?
According to my understanding, no and no.
Mitch has instruction
that saves architectural+microarchitectural
context in memory and any interrupt or exception has to use it.
Architectural part of saved buffer is documented. Microarchitectural
part, apart from its size, not so much.
That is, according to my understanding. Take it with amount of salt you
find appropriate.
On 2025-03-13 1:48 a.m., EricP wrote:
Lawrence D'Oliveiro wrote:If it executes in the background with its own local copy of registers it
On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:
On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:
On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:Yep; but also include able to take exceptions.
My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>>>> checking required.Presumably interruptible and resumable ...
So you have a VAX-style “first part done” processor status bit? And
you use architectural registers to save/restore the state of an
instruction in progress at the time of an interrupt?
A safe buffer move doesn't need a FPD flag (VAX) or direction (x86)
as long as (a) you don't specify the order bytes are actually moved and
(b) you only specify that at the end the length register will be 0
and the buffer address values are unspecified.
does not need to save state. It might need a means to suspend or cancel
the operation though.
So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it'sImagine that:: almost a page for memmove entry points.
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.
My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.
What is different about MM compared to `rep movsb`
that you can
confidently state that it will always be optimal?
Stefan
What is different about MM compared to `rep movsb`MM does not modify the pointers. MM keeps its current index,
thus the compiler can use the Rf pointer multiple times.
that you can confidently state that it will always be optimal?Compared to the explosion in memmove() subroutine, yes.
What is different about MM compared to `rep movsb`MM does not modify the pointers. MM keeps its current index,
thus the compiler can use the Rf pointer multiple times.
that you can confidently state that it will always be optimal?Compared to the explosion in memmove() subroutine, yes.
Are you suggesting that what prevents Intel to make `rep movsb` optimal
is the fact that it modifies its pointers?
I have no experience implementing such an instruction, but I find it odd
that such a "cosmetic detail" would have such an profound impact on the performance of an instruction. Can't they just "macroexpand" it during decoding into two instructions (one which copies the bytes without
modifying the pointers, and then one which just adjusts the pointers)?
Stefan
[...]On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:
What is different about MM compared to `rep movsb`
But they never really "tried all that hard" to make them
continuously Optimal.
And they have "So Many" extra burdens,
such as when from is MMI/O space access and to is cache coherent, and
all sorts of other self imposed problems. Using MTRRs one can switch
the kind of memory to and from point in the middle of a REP MOVs.
All of which do nothing to make optimality easier.
My 66000 happens to know that memory space changes will not happen
in the middle of these kinds of things (including vectorized Loops).
My compilers don't create such problems for HW to solve. {That is;
the truly horrific x86 optimality problems don't exist.}
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
All of which do no-
thing to make optimality easier.
So, at a certain point in time, designers punt. If all competing
parties punt, nobody is put asunder.
I have no experience implementing such an instruction, but I find
it odd that such a "cosmetic detail" would have such an profound
impact on the performance of an instruction. Can't they just
"macroexpand" it during decoding into two instructions (one which
copies the bytes without modifying the pointers, and then one which
just adjusts the pointers)?
My 66000 happens to know that memory space changes will not happen
in the middle of these kinds of things (including vectorized Loops).
My compilers don't create such problems for HW to solve. {That is;
the truly horrific x86 optimality problems don't exist.}
You may choose differently.
Stefan
MitchAlsup1 [2025-03-13 19:35:33] wrote:
[...]
[...]On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:
What is different about MM compared to `rep movsb`
But they never really "tried all that hard" to make them
continuously Optimal.
But is there a reason to presume an implementer of My 66000 would have
the luxury of putting more efforts into making MM "optimal" than Intel
put
into making `rep movsb`?
And they have "So Many" extra burdens,
Ah, now you seem to be getting to the kind of answer I was looking for.
such as when from is MMI/O space access and to is cache coherent, and
all sorts of other self imposed problems. Using MTRRs one can switch
the kind of memory to and from point in the middle of a REP MOVs.
All of which do nothing to make optimality easier.
How does MM avoid those complexities?
My 66000 happens to know that memory space changes will not happen
in the middle of these kinds of things (including vectorized Loops).
How does it know?
Is it because the ISA just says "don't do that" (I
guess MM would then signal an error if it happens?), or is there some underlying difference to the way the semantics/cachability of memory
pages is specified which makes it impossible to specify a memory range
to MM where the semantics changes partways?
My compilers don't create such problems for HW to solve. {That is;
the truly horrific x86 optimality problems don't exist.}
How do compilers getting in the picture? I thought they were basically ignorant of such subtleties of memory caching, as controlled by MTRRs.
Stefan
On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.
This case is pretty useful in practice.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
How exactly?
On Thu, 13 Mar 2025 19:53:25 +0000, Stefan Monnier wrote:
How does MM avoid those complexities?
Compiler only produces MM and MS where the memory is known to be
contiguous, and My 66000 universal address space is 64-bits in width
for each kind of address space, so no MM can cross such a boundary
(unless the GuestOS is aiming a gun at its feet mucking with the
PTEs).
4 � 64-bit PASsMy 66000 happens to know that memory space changes will not happenHow does it know?
in the middle of these kinds of things (including vectorized Loops).
1 � 64-bit VAS
The compiler uses MM for copying one chunk of virtually contiguous
memory to another chunk of vcm.
Compiler would not do this if there is any non-contiguousness. So, from
VAS, the access is well defined and compactly described.
During the performance of MM, a change in address space can fault
the performance allowing somebody more privileged to investigate.
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000The REP MOV straddles the boundary between two MTRRs.
[email protected] (MitchAlsup1) wrote:
Using MTRRs one can switch the kind of memoryHow exactly?
to and from point in the middle of a REP MOVs.
On Wed, 12 Mar 2025 16:46:36 GMT-------------
[email protected] (Anton Ertl) wrote:
Idiots from corporate IT blocked http://al.howardknight.net/
I feel with you. In my workplace, Usenet is blocked (probably
unintentionally). I have to post from home.
So, link to google groups
Sorry, I cannot provide that service. Trying to access
groups.google.com tells me:
|Couldn’t sign you in
|
|The browser you’re using doesn’t support JavaScript, or has >> JavaScript |turned off.
|
|To keep your Google Account secure, try signing in on a browser that
|has JavaScript turned on.
I certainly won't turn on JavaScript for Google, and apparently Google
wants me to log in to a Google account to access groups.google.com. I
don't have a Google account and I don't want one.
For me it works fine without login. But not without JS.
For those who are willing to use JS, the link: https://groups.google.com/g/comp.arch/c/ULvFgEM_ZSY/m/ysPySToGAwAJ
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.
This case is pretty useful in practice.
Although mostly done with DMA controllers in these modern times
to offload from the CPU.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
How exactly?
The REP MOV straddles the boundary between two MTRRs.
Scott Lurndal [2025-03-13 21:42:25] wrote:
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000The REP MOV straddles the boundary between two MTRRs.
[email protected] (MitchAlsup1) wrote:
Using MTRRs one can switch the kind of memoryHow exactly?
to and from point in the middle of a REP MOVs.
Why/when would this happen in practice?
On Thu, 13 Mar 2025 19:53:25 +0000, Stefan Monnier wrote:
MitchAlsup1 [2025-03-13 19:35:33] wrote:
[...]
[...]On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:
What is different about MM compared to `rep movsb`
But they never really "tried all that hard" to make them
continuously Optimal.
But is there a reason to presume an implementer of My 66000 would have
the luxury of putting more efforts into making MM "optimal" than Intel
put
into making `rep movsb`?
In one place we worked, there was a life sized plastic turtle (1-ft
and 2 pounds). Any the engineer who made the least amount of forward
progress every week was assigned the turtle at the corner of his/her
cubicle.
We found this "motivating"
On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.
This case is pretty useful in practice.
Although mostly done with DMA controllers in these modern times
to offload from the CPU.
For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly >wasted in idle loop.
Stefan Monnier <[email protected]> writes:
Scott Lurndal [2025-03-13 21:42:25] wrote:Nobody said it was a good idea.
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000The REP MOV straddles the boundary between two MTRRs.
[email protected] (MitchAlsup1) wrote:
Using MTRRs one can switch the kind of memoryHow exactly?
to and from point in the middle of a REP MOVs.
Why/when would this happen in practice?
On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.
This case is pretty useful in practice.
Although mostly done with DMA controllers in these modern times
to offload from the CPU.
For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly wasted in idle loop.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
How exactly?
The REP MOV straddles the boundary between two MTRRs.
Crossing boundary that way can typically be predicted far in advance,
so not really big problem.
I think, Mitch had something less mundane in mind.
Nobody said it was a good idea.The REP MOV straddles the boundary between two MTRRs.Why/when would this happen in practice?
Nobody said it was a good idea.The REP MOV straddles the boundary between two MTRRs.Why/when would this happen in practice?
But if it doesn't happen in normal cases, then it shouldn't be
significant to performance. So is the problem that just detecting the occurrence of this situation is already too costly to make `rep
movsb` fast?
[ Of course, I still haven't understood either why it technically can
happen in amd64 but not in My 66000. ]
Stefan
[ Of course, I still haven't understood either why it technically canThe cartesian product is smaller, more amenable to buffering and
happen in amd64 but not in My 66000. ]
caching, with more easily discovered (or eliminated) boundaries.
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is
MMI/O space access and to is cache coherent, and all sorts of
other self imposed problems.
This case is pretty useful in practice.
Although mostly done with DMA controllers in these modern times
to offload from the CPU.
For up to few hundreds bytes it would be slower. For few thousands
byte it could be faster at transfer level, but data ends up in the
wrong place in the memory hierarchy, too far away from the ultimate >consumer,
Most systems I work with have an 'allocate' attribute on
inbound DMA operations that will allocate in a specified
cache level (typically LLC).
Most DMA's are far more a hundred bytes, and the application
can be doing something else while the DMA is in process.
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just
uselessly wasted in idle loop.
Only by incompetent programmers.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
How exactly?
The REP MOV straddles the boundary between two MTRRs.
Crossing boundary that way can typically be predicted far in
advance, so not really big problem.
I think, Mitch had something less mundane in mind.
I was just trying to illustrate why optimal REP-MOVS is more difficult
than a SW person might initially guestimate.
One side might be a byte array down PCIe tree in config space,
while the destination is a line access only. Yeah, just try to
do this optimally.
Since MM is available in the interconnect protocol, one could
imagine one PCIe device transferring a page to another PCIe
device without the data stream ever touching DRAM (or L3).
Yeah, just try doing that with MTRRs and system MMUs in the
way. ...
Michael S wrote:
On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other
self imposed problems.
This case is pretty useful in practice.
Although mostly done with DMA controllers in these modern times
to offload from the CPU.
For up to few hundreds bytes it would be slower. For few thousands
byte it could be faster at transfer level, but data ends up in the
wrong place in the memory hierarchy, too far away from the ultimate consumer, so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just
uselessly wasted in idle loop.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
How exactly?
The REP MOV straddles the boundary between two MTRRs.
Crossing boundary that way can typically be predicted far in
advance, so not really big problem.
I think, Mitch had something less mundane in mind.
Yeah, I read it as some other core modifying the relevant MTTRs in
the middle of the ongoing block move.
The solution seems somewhat obvious, i.e any modification of an MTTR
which is involved in the move wil cause a hw interrupt. Upon
restarting the remainder of the move, the new MTTR rules apply?
The alternative would be to specify that any block move is atomic as
seen from the MTTR rules, i.e the update(s) only apply after the move
has finished?
Terje
On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.
This case is pretty useful in practice.
Although mostly done with DMA controllers in these modern times
to offload from the CPU.
For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly wasted in idle loop.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
How exactly?
The REP MOV straddles the boundary between two MTRRs.
Crossing boundary that way can typically be predicted far in advance,
so not really big problem.
I think, Mitch had something less mundane in mind.
On Thu, 13 Mar 2025 23:27:16 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
[snip]
For up to few hundreds bytes it would be slower. For few thousands
byte it could be faster at transfer level, but data ends up in the
wrong place in the memory hierarchy, too far away from the ultimate
consumer,
Most systems I work with have an 'allocate' attribute on
inbound DMA operations that will allocate in a specified
cache level (typically LLC).
Most DMA's are far more a hundred bytes, and the application
can be doing something else while the DMA is in process.
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just
uselessly wasted in idle loop.
Only by incompetent programmers.
It has nothing to do with competence of programmers and everything to
do with modern computers having more cores than their users need.
This applies not only to client system but to at least 3/4th of the
servers as well.
On Fri, 14 Mar 2025 00:18:18 +0000
[email protected] (MitchAlsup1) wrote:
Since MM is available in the interconnect protocol, one could
imagine one PCIe device transferring a page to another PCIe
device without the data stream ever touching DRAM (or L3).
Yeah, just try doing that with MTRRs and system MMUs in the
way. ...
Outside of graphics drivers for exotic multi-GPU setups, I don't see it >happening for the reasons not related to HW.
On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.
This case is pretty useful in practice.
Although mostly done with DMA controllers in these modern times
to offload from the CPU.
For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly >wasted in idle loop.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
How exactly?
The REP MOV straddles the boundary between two MTRRs.
Crossing boundary that way can typically be predicted far in advance,
so not really big problem.
On Thu, 13 Mar 2025 23:25:23 +0000, Scott Lurndal wrote:
Stefan Monnier <[email protected]> writes:
Scott Lurndal [2025-03-13 21:42:25] wrote:Nobody said it was a good idea.
Michael S <[email protected]> writes:
On Thu, 13 Mar 2025 19:35:33 +0000The REP MOV straddles the boundary between two MTRRs.
[email protected] (MitchAlsup1) wrote:
Using MTRRs one can switch the kind of memoryHow exactly?
to and from point in the middle of a REP MOVs.
Why/when would this happen in practice?
I can envision an attack strategy using this to "confuse"
someone in the higher privilege levels of the "system"
On Thu, 13 Mar 2025 22:16:19 +0000, Michael S wrote:
Since MM is available in the interconnect protocol, one could
imagine one PCIe device transferring a page to another PCIe
device without the data stream ever touching DRAM (or L3).
As for the "transfer level speed", I would not know why delivering to
DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as slow
as the other variants.
In any case, that's not what most uses of memcpy() or memmove(), or
rep movsb with their synchronous interfaces are about.
Michael S <[email protected]> writes:
For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly >>wasted in idle loop.
The usual case where "from" is memory-mapped I/O and "to" is
cache-coherent is when loading from an NVME SSD. AFAIK this is
usually done in larger block sizes, because of the overhead of setting
up the DMA, and is usually done in an asynchronous way.
As for the wrong level: The DMA engine transfers the data to the CPU
chip in any case: it contains all caches and the DRAM controller. It
might put the data in, e.g., L3 cache, marked dirty, for later
writeback to DRAM, and if a CPU accesses that memory soon, it will
only see the latency and bandwidth limits of L3.
I have certainly read about a project for high-speed network routing
where the network cards deliver the packets to L3, and the routing
software has to process each packet in an average of 70ns; if the
packets were delivered to DRAM, that speed would be impossible.
As for the "transfer level speed", I would not know why delivering to
DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as slow
as the other variants.
Crossing boundary that way can typically be predicted far in advance,
so not really big problem.
It does not happen in practice, so making it fast or "optimal" by
using a prediction is not necessary.
Michael S <[email protected]> writes:
On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:
As for the "transfer level speed", I would not know why delivering
to DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as
slow as the other variants.
Transfer level speed would be faster with DMA, because CPU typically
has no way to issue Read requests for chunks of data that are bigger
than 64 bytes.
ARM has load and store instructions that load/store 64 byte chunks
of data. This are primarily aimed at accelerators used to offload
certain computations.
On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:
As for the "transfer level speed", I would not know why delivering to
DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as slow
as the other variants.
Transfer level speed would be faster with DMA, because CPU typically has
no way to issue Read requests for chunks of data that are bigger than 64 >bytes.
On Fri, 14 Mar 2025 14:46:05 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:
As for the "transfer level speed", I would not know why delivering
to DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as
slow as the other variants.
Transfer level speed would be faster with DMA, because CPU typically
has no way to issue Read requests for chunks of data that are bigger
than 64 bytes.
ARM has load and store instructions that load/store 64 byte chunks
of data. This are primarily aimed at accelerators used to offload
certain computations.
Which does not contradict my statement above. 64 bytes are not bigger
than 64 bytes.
On Fri, 14 Mar 2025 14:51:21 +0000, Michael S wrote:
On Fri, 14 Mar 2025 14:46:05 GMT
[email protected] (Scott Lurndal) wrote:
Michael S <[email protected]> writes:
On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:
As for the "transfer level speed", I would not know why delivering
to DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as
slow as the other variants.
Transfer level speed would be faster with DMA, because CPU typically >>>>has no way to issue Read requests for chunks of data that are bigger >>>>than 64 bytes.
ARM has load and store instructions that load/store 64 byte chunks
of data. This are primarily aimed at accelerators used to offload
certain computations.
Which does not contradict my statement above. 64 bytes are not bigger
than 64 bytes.
Just a note:
My 66000 core doing Memmove() MM instruction::
unCacheable DRAM can do page at a time transfers
Cacheable DRAM can do line at a time transfers
over the interconnect.
The usual case where "from" is memory-mapped I/O and "to" is
cache-coherent is when loading from an NVME SSD. AFAIK this is
usually done in larger block sizes, because of the overhead of setting
up the DMA, and is usually done in an asynchronous way.
Anton Ertl [2025-03-14 13:18:37] wrote:
The usual case where "from" is memory-mapped I/O and "to" is
cache-coherent is when loading from an NVME SSD. AFAIK this is
usually done in larger block sizes, because of the overhead of setting
up the DMA, and is usually done in an asynchronous way.
Also, is the data transfer separate from the "disk" access? I'd expect
that the NVMe interface lets the CPU say "read block B and DMA it to
DRAM at address X" (after which we get an interrupt), so there is no opportunity for a `rep movsb` or `MM` instruction to do part of the job.
Stefan
One place I worked, we serialized checkins using a rubber chicken,
which was hung on the cube wall.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 15:18:51 |
| Calls: | 12,102 |
| Calls today: | 2 |
| Files: | 15,004 |
| Messages: | 6,518,048 |