Forum: >>> Magnum BBS <<<

Re: rep movsb vs. simpler instructions for memcpy/memmove

From MitchAlsup1@21:1/5 to Anton Ertl on Wed Mar 12 17:44:11 2025

On Wed, 12 Mar 2025 16:46:36 +0000, Anton Ertl wrote:

So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.

Imagine that:: almost a page for memmove entry points.

My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu Mar 13 00:03:51 2025

On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

My 66000 has MM memmove as an instruction (4-bytes) always optimal, no checking required.

Presumably interruptible and resumable ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Mar 13 00:49:47 2025

On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

My 66000 has MM memmove as an instruction (4-bytes) always optimal, no
checking required.

Presumably interruptible and resumable ...

Yep; but also include able to take exceptions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu Mar 13 02:34:11 2025

On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

My 66000 has MM memmove as an instruction (4-bytes) always optimal, no
checking required.

Presumably interruptible and resumable ...

Yep; but also include able to take exceptions.

So you have a VAX-style “first part done” processor status bit? And you
use architectural registers to save/restore the state of an instruction in progress at the time of an interrupt?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to Lawrence D'Oliveiro on Thu Mar 13 01:48:01 2025

Lawrence D'Oliveiro wrote:

On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>> checking required.

Presumably interruptible and resumable ...

Yep; but also include able to take exceptions.

So you have a VAX-style “first part done” processor status bit? And you use architectural registers to save/restore the state of an instruction in progress at the time of an interrupt?

A safe buffer move doesn't need a FPD flag (VAX) or direction (x86)
as long as (a) you don't specify the order bytes are actually moved and
(b) you only specify that at the end the length register will be 0
and the buffer address values are unspecified.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Mar 13 09:40:09 2025

So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.

Imagine that:: almost a page for memmove entry points.
My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.

What is different about MM compared to `rep movsb` that you can
confidently state that it will always be optimal?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Stefan Monnier on Thu Mar 13 16:23:47 2025

On Thu, 13 Mar 2025 09:40:09 -0400
Stefan Monnier <[email protected]> wrote:

So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.

Imagine that:: almost a page for memmove entry points.
My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.

What is different about MM compared to `rep movsb` that you can
confidently state that it will always be optimal?

Stefan

Paper is different from silicon. Far superior.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Thu Mar 13 16:29:54 2025

On Thu, 13 Mar 2025 02:34:11 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.

Presumably interruptible and resumable ...

Yep; but also include able to take exceptions.

So you have a VAX-style “first part done” processor status bit? And
you use architectural registers to save/restore the state of an
instruction in progress at the time of an interrupt?

According to my understanding, no and no.
Mitch has instruction that saves architectural+microarchitectural
context in memory and any interrupt or exception has to use it.
Architectural part of saved buffer is documented. Microarchitectural
part, apart from its size, not so much.
That is, according to my understanding. Take it with amount of salt you
find appropriate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Mar 13 16:10:46 2025

On Thu, 13 Mar 2025 2:34:11 +0000, Lawrence D'Oliveiro wrote:

On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>> checking required.

Presumably interruptible and resumable ...

Yep; but also include able to take exceptions.

So you have a VAX-style “first part done” processor status bit?

In effect, yes; however it is a valid bit on a "remaining count"
control register.

And you
use architectural registers to save/restore the state of an instruction
in progress at the time of an interrupt?

Technically, registers are saved and reloaded as if the RF was
4 lines of write back cache--5 lines if you count the thread header.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Thu Mar 13 16:19:53 2025

On Thu, 13 Mar 2025 14:29:54 +0000, Michael S wrote:

On Thu, 13 Mar 2025 02:34:11 -0000 (UTC)
Lawrence D'Oliveiro <[email protected]d> wrote:

On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.

Presumably interruptible and resumable ...

Yep; but also include able to take exceptions.

So you have a VAX-style “first part done” processor status bit? And
you use architectural registers to save/restore the state of an
instruction in progress at the time of an interrupt?

According to my understanding, no and no.
Mitch has instruction

s/instruction/hardware means/

that saves architectural+microarchitectural
context in memory and any interrupt or exception has to use it.

Technically, you don't have to use it; it happens automatically.
In one instant you are executing thread[k] in core[j], the next
instant you are executing thread[m] in core[j] without SW over-
head. Thread[j] and [k] are not related and share no state.

Architectural part of saved buffer is documented. Microarchitectural
part, apart from its size, not so much.
That is, according to my understanding. Take it with amount of salt you
find appropriate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 13 16:15:54 2025

On Thu, 13 Mar 2025 9:10:36 +0000, Robert Finch wrote:

On 2025-03-13 1:48 a.m., EricP wrote:

Lawrence D'Oliveiro wrote:

On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>>>> checking required.

Presumably interruptible and resumable ...

Yep; but also include able to take exceptions.

So you have a VAX-style “first part done” processor status bit? And
you use architectural registers to save/restore the state of an
instruction in progress at the time of an interrupt?

A safe buffer move doesn't need a FPD flag (VAX) or direction (x86)
as long as (a) you don't specify the order bytes are actually moved and
(b) you only specify that at the end the length register will be 0
and the buffer address values are unspecified.

If it executes in the background with its own local copy of registers it
does not need to save state. It might need a means to suspend or cancel
the operation though.

MM is NOT MOV (Rt)+,(Rf)+

MM is MOV [Rt,µi<<width],[Rf,µi<<width]
ADD µi,µi,width

Where µi is not visible to the thread. Thus Rt and Rf are not modified.

Since the pointers are not modified, the whole MOV can be bundled up
and moved (sic) to whatever layer of the cache hierarchy appropriate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Mar 13 16:24:03 2025

On Thu, 13 Mar 2025 13:40:09 +0000, Stefan Monnier wrote:

So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
still the biggest implementation, but many others are quite a bit
bigger than the 0x113=275 bytes of my ssememmove.

Imagine that:: almost a page for memmove entry points.
My 66000 has MM memmove as an instruction (4-bytes) always
optimal, no checking required.

What is different about MM compared to `rep movsb`

MM does not modify the pointers. MM keeps its current index,
thus the compiler can use the Rf pointer multiple times.

that you can
confidently state that it will always be optimal?

Compared to the explosion in memmove() subroutine, yes.
Compared to a device living on My 66000 interconnect, maybe.
Compared to executing instructions on a core, yes.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Mar 13 12:43:07 2025

What is different about MM compared to `rep movsb`

MM does not modify the pointers. MM keeps its current index,
thus the compiler can use the Rf pointer multiple times.

that you can confidently state that it will always be optimal?

Compared to the explosion in memmove() subroutine, yes.

Are you suggesting that what prevents Intel to make `rep movsb` optimal
is the fact that it modifies its pointers?

I have no experience implementing such an instruction, but I find it odd
that such a "cosmetic detail" would have such an profound impact on the performance of an instruction. Can't they just "macroexpand" it during decoding into two instructions (one which copies the bytes without
modifying the pointers, and then one which just adjusts the pointers)?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Mar 13 19:35:33 2025

On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:

What is different about MM compared to `rep movsb`

MM does not modify the pointers. MM keeps its current index,
thus the compiler can use the Rf pointer multiple times.

that you can confidently state that it will always be optimal?

Compared to the explosion in memmove() subroutine, yes.

Are you suggesting that what prevents Intel to make `rep movsb` optimal
is the fact that it modifies its pointers?

Certainly does not help.

But they never really "tried all that hard" to make them continuously
Optimal.

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems. Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs. All of which do no-
thing to make optimality easier.

So, at a certain point in time, designers punt. If all competing
parties punt, nobody is put asunder.

I have no experience implementing such an instruction, but I find it odd
that such a "cosmetic detail" would have such an profound impact on the performance of an instruction. Can't they just "macroexpand" it during decoding into two instructions (one which copies the bytes without
modifying the pointers, and then one which just adjusts the pointers)?

My 66000 happens to know that memory space changes will not happen
in the middle of these kinds of things (including vectorized Loops).
My compilers don't create such problems for HW to solve. {That is;
the truly horrific x86 optimality problems don't exist.}

You may choose differently.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Mar 13 15:53:25 2025

MitchAlsup1 [2025-03-13 19:35:33] wrote:
[...]

On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:

What is different about MM compared to `rep movsb`

[...]

But they never really "tried all that hard" to make them
continuously Optimal.

But is there a reason to presume an implementer of My 66000 would have
the luxury of putting more efforts into making MM "optimal" than Intel put
into making `rep movsb`?

And they have "So Many" extra burdens,

Ah, now you seem to be getting to the kind of answer I was looking for.

such as when from is MMI/O space access and to is cache coherent, and
all sorts of other self imposed problems. Using MTRRs one can switch
the kind of memory to and from point in the middle of a REP MOVs.
All of which do nothing to make optimality easier.

How does MM avoid those complexities?

My 66000 happens to know that memory space changes will not happen
in the middle of these kinds of things (including vectorized Loops).

How does it know? Is it because the ISA just says "don't do that" (I
guess MM would then signal an error if it happens?), or is there some underlying difference to the way the semantics/cachability of memory
pages is specified which makes it impossible to specify a memory range
to MM where the semantics changes partways?

My compilers don't create such problems for HW to solve. {That is;
the truly horrific x86 optimality problems don't exist.}

How do compilers getting in the picture? I thought they were basically ignorant of such subtleties of memory caching, as controlled by MTRRs.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Thu Mar 13 22:55:16 2025

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.

This case is pretty useful in practice.

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

All of which do no-
thing to make optimality easier.

So, at a certain point in time, designers punt. If all competing
parties punt, nobody is put asunder.

I have no experience implementing such an instruction, but I find
it odd that such a "cosmetic detail" would have such an profound
impact on the performance of an instruction. Can't they just
"macroexpand" it during decoding into two instructions (one which
copies the bytes without modifying the pointers, and then one which
just adjusts the pointers)?

My 66000 happens to know that memory space changes will not happen
in the middle of these kinds of things (including vectorized Loops).
My compilers don't create such problems for HW to solve. {That is;
the truly horrific x86 optimality problems don't exist.}

How high are you aiming?
How many bytes per clock when source and destination do not overlap
and both reside in L1D$ ?
How many bytes when one side in L1D$ and another in L2$?

If the answer is less than 50 in the first case and less than 30 in the
2nd case then your are aiming uninterestingly low.

You may choose differently.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Mar 13 20:59:26 2025

On Thu, 13 Mar 2025 19:53:25 +0000, Stefan Monnier wrote:

MitchAlsup1 [2025-03-13 19:35:33] wrote:
[...]

On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:

What is different about MM compared to `rep movsb`

[...]

But they never really "tried all that hard" to make them
continuously Optimal.

But is there a reason to presume an implementer of My 66000 would have
the luxury of putting more efforts into making MM "optimal" than Intel
put
into making `rep movsb`?

In one place we worked, there was a life sized plastic turtle (1-ft
and 2 pounds). Any the engineer who made the least amount of forward
progress every week was assigned the turtle at the corner of his/her
cubicle.

We found this "motivating"

And then there would be "me" assessing their accomplishments and
making fast MM and MS a priority goal to "him".

And they have "So Many" extra burdens,

Ah, now you seem to be getting to the kind of answer I was looking for.

such as when from is MMI/O space access and to is cache coherent, and
all sorts of other self imposed problems. Using MTRRs one can switch
the kind of memory to and from point in the middle of a REP MOVs.
All of which do nothing to make optimality easier.

How does MM avoid those complexities?

Compiler only produces MM and MS where the memory is known to be
contiguous,
and My 66000 universal address space is 64-bits in width for each kind
of
address space, so no MM can cross such a boundary (unless the GuestOS
is aiming a gun at its feet mucking with the PTEs).

My 66000 happens to know that memory space changes will not happen
in the middle of these kinds of things (including vectorized Loops).

How does it know?

4 × 64-bit PASs
1 × 64-bit VAS

And how the compiler selects using MM or MS.
So, the compiler and Guest OS have to both make different mistakes that
is then not caught by Hypervisor translation tables.

Is it because the ISA just says "don't do that" (I
guess MM would then signal an error if it happens?), or is there some underlying difference to the way the semantics/cachability of memory
pages is specified which makes it impossible to specify a memory range
to MM where the semantics changes partways?

My compilers don't create such problems for HW to solve. {That is;
the truly horrific x86 optimality problems don't exist.}

How do compilers getting in the picture? I thought they were basically ignorant of such subtleties of memory caching, as controlled by MTRRs.

The compiler uses MM for copying one chunk of virtually contiguous
memory to another chunk of vcm.

Compiler would not do this if there is any non-contiguousness. So, from
VAS, the access is well defined and compactly described.

During the performance of MM, a change in address space can fault
the performance allowing somebody more privileged to investigate.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Thu Mar 13 21:42:25 2025

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.

This case is pretty useful in practice.

Although mostly done with DMA controllers in these modern times
to offload from the CPU.

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Mar 13 17:44:53 2025

MitchAlsup1 [2025-03-13 20:59:26] wrote:

On Thu, 13 Mar 2025 19:53:25 +0000, Stefan Monnier wrote:

How does MM avoid those complexities?

Compiler only produces MM and MS where the memory is known to be
contiguous, and My 66000 universal address space is 64-bits in width
for each kind of address space, so no MM can cross such a boundary
(unless the GuestOS is aiming a gun at its feet mucking with the
PTEs).

What is `MS`?
Isn't it also the case in `rep movsb`?
At least assuming an OS like Linux?

My 66000 happens to know that memory space changes will not happen
in the middle of these kinds of things (including vectorized Loops).

How does it know?

4 � 64-bit PASs
1 � 64-bit VAS

Sorry, I think you went too fast, you lost me here. I'm just a poor
compiler guy with a side-interest in computer architecture.

Presumably the needs that MTRRs satisfy can also be satisfied in My
66000, so I guess what I'm missing here is how My 66000's solution is
different from the amd64/i386 one and how that ends up providing MM with
a guarantee that it doesn't need to care?

It seems to me, there might still be cases where a My 66000 system might
want to copy bytes between a network card buffer and DRAM, so while
I don't expect the cachability of either source or destination to change
in the middle of an MM operation (and I similarly would be fine with
a `rep movsb` that becomes slow if this ever happens), I do expect MM operations to transfer data between areas that don't have the
same cachability.

The compiler uses MM for copying one chunk of virtually contiguous
memory to another chunk of vcm.
Compiler would not do this if there is any non-contiguousness. So, from
VAS, the access is well defined and compactly described.

In which way does this not also apply to `rep movsb`?

During the performance of MM, a change in address space can fault
the performance allowing somebody more privileged to investigate.

What do you mean by "change in address space"?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Mar 13 17:48:48 2025

Scott Lurndal [2025-03-13 21:42:25] wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Why/when would this happen in practice?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Thu Mar 13 22:07:08 2025

On Thu, 13 Mar 2025 21:06:02 +0000, Michael S wrote:

On Wed, 12 Mar 2025 16:46:36 GMT
[email protected] (Anton Ertl) wrote:

-------------

Idiots from corporate IT blocked http://al.howardknight.net/

I feel with you. In my workplace, Usenet is blocked (probably
unintentionally). I have to post from home.

So, link to google groups

Sorry, I cannot provide that service. Trying to access
groups.google.com tells me:

|Couldnâ€™t sign you in
|
|The browser youâ€™re using doesnâ€™t support JavaScript, or has >> JavaScript |turned off.
|
|To keep your Google Account secure, try signing in on a browser that
|has JavaScript turned on.

I certainly won't turn on JavaScript for Google, and apparently Google
wants me to log in to a Google account to access groups.google.com. I
don't have a Google account and I don't want one.

For me it works fine without login. But not without JS.
For those who are willing to use JS, the link: https://groups.google.com/g/comp.arch/c/ULvFgEM_ZSY/m/ysPySToGAwAJ

Prior to the attack 9 months ago:: Google Groups was happy
to use my AOL.email.address. I have not tried recently.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Fri Mar 14 00:16:19 2025

On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.

This case is pretty useful in practice.

Although mostly done with DMA controllers in these modern times
to offload from the CPU.

For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly
wasted in idle loop.

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Crossing boundary that way can typically be predicted far in advance,
so not really big problem.
I think, Mitch had something less mundane in mind.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Stefan Monnier on Thu Mar 13 23:25:23 2025

Stefan Monnier <[email protected]> writes:

Scott Lurndal [2025-03-13 21:42:25] wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Why/when would this happen in practice?

Nobody said it was a good idea.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Thu Mar 13 23:24:03 2025

[email protected] (MitchAlsup1) writes:

On Thu, 13 Mar 2025 19:53:25 +0000, Stefan Monnier wrote:

MitchAlsup1 [2025-03-13 19:35:33] wrote:
[...]

On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:

What is different about MM compared to `rep movsb`

[...]

But they never really "tried all that hard" to make them
continuously Optimal.

But is there a reason to presume an implementer of My 66000 would have
the luxury of putting more efforts into making MM "optimal" than Intel
put
into making `rep movsb`?

In one place we worked, there was a life sized plastic turtle (1-ft
and 2 pounds). Any the engineer who made the least amount of forward
progress every week was assigned the turtle at the corner of his/her
cubicle.

One place I worked, we serialized checkins using a rubber chicken,
which was hung on the cube wall.

We found this "motivating"

We found the checkin chicken humorous.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Thu Mar 13 23:27:16 2025

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.

This case is pretty useful in practice.

Although mostly done with DMA controllers in these modern times
to offload from the CPU.

For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,

Most systems I work with have an 'allocate' attribute on
inbound DMA operations that will allocate in a specified
cache level (typically LLC).

Most DMA's are far more a hundred bytes, and the application
can be doing something else while the DMA is in process.

so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly >wasted in idle loop.

Only by incompetent programmers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Mar 14 00:19:41 2025

On Thu, 13 Mar 2025 23:25:23 +0000, Scott Lurndal wrote:

Stefan Monnier <[email protected]> writes:

Scott Lurndal [2025-03-13 21:42:25] wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Why/when would this happen in practice?

Nobody said it was a good idea.

I can envision an attack strategy using this to "confuse"
someone in the higher privilege levels of the "system"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Fri Mar 14 00:18:18 2025

On Thu, 13 Mar 2025 22:16:19 +0000, Michael S wrote:

On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.

This case is pretty useful in practice.

Although mostly done with DMA controllers in these modern times
to offload from the CPU.

For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly wasted in idle loop.

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Crossing boundary that way can typically be predicted far in advance,
so not really big problem.
I think, Mitch had something less mundane in mind.

I was just trying to illustrate why optimal REP-MOVS is more difficult
than a SW person might initially guestimate.

One side might be a byte array down PCIe tree in config space,
while the destination is a line access only. Yeah, just try to
do this optimally.

Since MM is available in the interconnect protocol, one could
imagine one PCIe device transferring a page to another PCIe
device without the data stream ever touching DRAM (or L3).
Yeah, just try doing that with MTRRs and system MMUs in the
way. ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Mar 13 21:08:38 2025

The REP MOV straddles the boundary between two MTRRs.

Why/when would this happen in practice?

Nobody said it was a good idea.

But if it doesn't happen in normal cases, then it shouldn't be
significant to performance. So is the problem that just detecting the occurrence of this situation is already too costly to make `rep
movsb` fast?

[ Of course, I still haven't understood either why it technically can
happen in amd64 but not in My 66000. ]

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Mar 14 01:30:06 2025

On Fri, 14 Mar 2025 1:08:38 +0000, Stefan Monnier wrote:

The REP MOV straddles the boundary between two MTRRs.

Why/when would this happen in practice?

Nobody said it was a good idea.

But if it doesn't happen in normal cases, then it shouldn't be
significant to performance. So is the problem that just detecting the occurrence of this situation is already too costly to make `rep
movsb` fast?

A camel's back is only so strong.

Conjecture that there are 14-different kinds of memory on both
source and destination. So we need a 14×14 check on where we are
every cycle, or every time a boundary could be crossed.

Now, µCode (or HW sequencer) needs to check certain things at
certain boundaries, and switch optimal[DRAM,DRAM] to
optimal[Streaming-store, PCIe-config-space] on a cycle's notice,
while adjusting its memory model from "causal" to strongly ordered.

[ Of course, I still haven't understood either why it technically can
happen in amd64 but not in My 66000. ]

The cartesian product is smaller, more amenable to buffering and
caching, with more easily discovered (or eliminated) boundaries.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Mar 13 22:20:16 2025

[ Of course, I still haven't understood either why it technically can
happen in amd64 but not in My 66000. ]

The cartesian product is smaller, more amenable to buffering and
caching, with more easily discovered (or eliminated) boundaries.

These are the parts I can guess. But the part I don't get is what makes
the factors of your cartesian product smaller, what makes your CPU
more amenable to buffering and caching, and what makes those boundaries
easier to discover or eliminate in My 66000 than in amd64.

From what I have gathered so far, the difference in optimizability
between `MM` and `rep movsb` is not due to the semantics of the
instruction, but in the rest of the CPU.

I guess part of my question is: would an `MM` instruction added to, say,
RISC-V or ARM be as easy to optimize as for My 66000 or would it be more
like for the amd64?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Fri Mar 14 12:10:23 2025

On Thu, 13 Mar 2025 23:27:16 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is
MMI/O space access and to is cache coherent, and all sorts of
other self imposed problems.

This case is pretty useful in practice.

Although mostly done with DMA controllers in these modern times
to offload from the CPU.

For up to few hundreds bytes it would be slower. For few thousands
byte it could be faster at transfer level, but data ends up in the
wrong place in the memory hierarchy, too far away from the ultimate >consumer,

Most systems I work with have an 'allocate' attribute on
inbound DMA operations that will allocate in a specified
cache level (typically LLC).

Most DMA's are far more a hundred bytes, and the application
can be doing something else while the DMA is in process.

so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just
uselessly wasted in idle loop.

Only by incompetent programmers.

It has nothing to do with competence of programmers and everything to
do with modern computers having more cores than their users need.
This applies not only to client system but to at least 3/4th of the
servers as well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to [email protected] on Fri Mar 14 12:06:12 2025

On Fri, 14 Mar 2025 00:18:18 +0000
[email protected] (MitchAlsup1) wrote:

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Crossing boundary that way can typically be predicted far in
advance, so not really big problem.
I think, Mitch had something less mundane in mind.

I was just trying to illustrate why optimal REP-MOVS is more difficult
than a SW person might initially guestimate.

This particular SW person thinks that availability of microcode +
microtraps simplifies handling of corner cases correctly in a way that
does not affect the speed of common cases. The only really hard part is
how to reduce startup overhead.

One side might be a byte array down PCIe tree in config space,
while the destination is a line access only. Yeah, just try to
do this optimally.

There is no need to do this particular case optimally.

Since MM is available in the interconnect protocol, one could
imagine one PCIe device transferring a page to another PCIe
device without the data stream ever touching DRAM (or L3).
Yeah, just try doing that with MTRRs and system MMUs in the
way. ...

Outside of graphics drivers for exotic multi-GPU setups, I don't see it happening for the reasons not related to HW. A system software is not structured in a way that makes it possible.
And multi-GPU setups were all rage 20 years ago, much less so today.
That is, today people use multiple GPUs more than ever, but they are
used to run independent jobs rather than to co-operate.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Terje Mathisen on Fri Mar 14 14:00:37 2025

On Fri, 14 Mar 2025 12:52:02 +0100
Terje Mathisen <[email protected]> wrote:

Michael S wrote:

On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other
self imposed problems.

This case is pretty useful in practice.

Although mostly done with DMA controllers in these modern times
to offload from the CPU.

For up to few hundreds bytes it would be slower. For few thousands
byte it could be faster at transfer level, but data ends up in the
wrong place in the memory hierarchy, too far away from the ultimate consumer, so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just
uselessly wasted in idle loop.

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Crossing boundary that way can typically be predicted far in
advance, so not really big problem.
I think, Mitch had something less mundane in mind.

Yeah, I read it as some other core modifying the relevant MTTRs in
the middle of the ongoing block move.

MTRRs are not in memory. They are MSRs, each HW thread has its own set.
So, AFAIK, modification by other core/thread is not possible.

The solution seems somewhat obvious, i.e any modification of an MTTR
which is involved in the move wil cause a hw interrupt. Upon
restarting the remainder of the move, the new MTTR rules apply?

The alternative would be to specify that any block move is atomic as
seen from the MTTR rules, i.e the update(s) only apply after the move
has finished?

Terje

May be, for some other complications. For MTRRs I see no need.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Michael S on Fri Mar 14 12:52:02 2025

Michael S wrote:

On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.

This case is pretty useful in practice.

Although mostly done with DMA controllers in these modern times
to offload from the CPU.

For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly wasted in idle loop.

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Crossing boundary that way can typically be predicted far in advance,
so not really big problem.
I think, Mitch had something less mundane in mind.

Yeah, I read it as some other core modifying the relevant MTTRs in the
middle of the ongoing block move.

The solution seems somewhat obvious, i.e any modification of an MTTR
which is involved in the move wil cause a hw interrupt. Upon restarting
the remainder of the move, the new MTTR rules apply?

The alternative would be to specify that any block move is atomic as
seen from the MTTR rules, i.e the update(s) only apply after the move
has finished?

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Cross@21:1/5 to [email protected] on Fri Mar 14 11:15:45 2025

In article <[email protected]>,
Michael S <[email protected]> wrote:

On Thu, 13 Mar 2025 23:27:16 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

[snip]
For up to few hundreds bytes it would be slower. For few thousands
byte it could be faster at transfer level, but data ends up in the
wrong place in the memory hierarchy, too far away from the ultimate
consumer,

Most systems I work with have an 'allocate' attribute on
inbound DMA operations that will allocate in a specified
cache level (typically LLC).

Most DMA's are far more a hundred bytes, and the application
can be doing something else while the DMA is in process.

so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just
uselessly wasted in idle loop.

Only by incompetent programmers.

It has nothing to do with competence of programmers and everything to
do with modern computers having more cores than their users need.
This applies not only to client system but to at least 3/4th of the
servers as well.

Define "need" though. Many users are running programs that do
make use of those resources, and giving them up for IO would be
a poor tradeoff. After all, it takes a lot of juice to have
rounded corners on a window where someone's watching a 1080p
video of a cat chasing a paper airplane.

One might argue that that is the thing the user does not need to
do, but that's the user's perogative, not ours.

- Dan C.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Fri Mar 14 12:25:32 2025

Michael S <[email protected]> writes:

On Fri, 14 Mar 2025 00:18:18 +0000
[email protected] (MitchAlsup1) wrote:

Since MM is available in the interconnect protocol, one could
imagine one PCIe device transferring a page to another PCIe
device without the data stream ever touching DRAM (or L3).
Yeah, just try doing that with MTRRs and system MMUs in the
way. ...

Outside of graphics drivers for exotic multi-GPU setups, I don't see it >happening for the reasons not related to HW.

Since a few years, there is some buzz (maybe it's just a marketing
feature) about loading textures etc. directly from the SSD into the
graphics memory. However, the programs that do this know that they
are doing this and don't do it with a synchronous instruction like rep
movsb or MM (I expect that MM is also synchronous).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to Michael S on Fri Mar 14 13:18:37 2025

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 21:42:25 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.

This case is pretty useful in practice.

Although mostly done with DMA controllers in these modern times
to offload from the CPU.

For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly >wasted in idle loop.

The usual case where "from" is memory-mapped I/O and "to" is
cache-coherent is when loading from an NVME SSD. AFAIK this is
usually done in larger block sizes, because of the overhead of setting
up the DMA, and is usually done in an asynchronous way.

As for the wrong level: The DMA engine transfers the data to the CPU
chip in any case: it contains all caches and the DRAM controller. It
might put the data in, e.g., L3 cache, marked dirty, for later
writeback to DRAM, and if a CPU accesses that memory soon, it will
only see the latency and bandwidth limits of L3.

I have certainly read about a project for high-speed network routing
where the network cards deliver the packets to L3, and the routing
software has to process each packet in an average of 70ns; if the
packets were delivered to DRAM, that speed would be impossible.

As for the "transfer level speed", I would not know why delivering to
DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as slow
as the other variants.

In any case, that's not what most uses of memcpy() or memmove(), or
rep movsb with their synchronous interfaces are about.

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Crossing boundary that way can typically be predicted far in advance,
so not really big problem.

It does not happen in practice, so making it fast or "optimal" by
using a prediction is not necessary.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <[email protected]>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Fri Mar 14 14:35:13 2025

[email protected] (MitchAlsup1) writes:

On Thu, 13 Mar 2025 23:25:23 +0000, Scott Lurndal wrote:

Stefan Monnier <[email protected]> writes:

Scott Lurndal [2025-03-13 21:42:25] wrote:

Michael S <[email protected]> writes:

On Thu, 13 Mar 2025 19:35:33 +0000
[email protected] (MitchAlsup1) wrote:

Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.

How exactly?

The REP MOV straddles the boundary between two MTRRs.

Why/when would this happen in practice?

Nobody said it was a good idea.

I can envision an attack strategy using this to "confuse"
someone in the higher privilege levels of the "system"

generally the MTRR's control the cachability, not the privilege.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Fri Mar 14 14:38:19 2025

[email protected] (MitchAlsup1) writes:

On Thu, 13 Mar 2025 22:16:19 +0000, Michael S wrote:

Since MM is available in the interconnect protocol, one could
imagine one PCIe device transferring a page to another PCIe
device without the data stream ever touching DRAM (or L3).

That's called peer-to-peer PCI and isn't exactly uncommon
in higher-end systems.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Anton Ertl on Fri Mar 14 16:20:09 2025

On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:

As for the "transfer level speed", I would not know why delivering to
DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as slow
as the other variants.

Transfer level speed would be faster with DMA, because CPU typically has
no way to issue Read requests for chunks of data that are bigger than 64
bytes.
OTOH, DMA resides on device itself and uses as big transfer unit as appropriate, up to maximum of 4 KB.
In theory, "rep movsb" can generate bigger (than 64B) read transfers,
but I don't belive that by now state of the art is that advanced.
Besides, on all PCE buses, but especially so on PCIe, write transfers
(DMA is doing Write transfer in this case) utilizes bus significantly
better than read transfers. The difference is most pronounced for
small transfers, but on something like 4-lane PCIe Gen4 the difference
can be quite big even when Read transactions uses maximal transfer size.

In any case, that's not what most uses of memcpy() or memmove(), or
rep movsb with their synchronous interfaces are about.

Agreed.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Anton Ertl on Fri Mar 14 14:44:22 2025

[email protected] (Anton Ertl) writes:

Michael S <[email protected]> writes:

For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly >>wasted in idle loop.

The usual case where "from" is memory-mapped I/O and "to" is
cache-coherent is when loading from an NVME SSD. AFAIK this is
usually done in larger block sizes, because of the overhead of setting
up the DMA, and is usually done in an asynchronous way.

As for the wrong level: The DMA engine transfers the data to the CPU
chip in any case: it contains all caches and the DRAM controller. It
might put the data in, e.g., L3 cache, marked dirty, for later
writeback to DRAM, and if a CPU accesses that memory soon, it will
only see the latency and bandwidth limits of L3.

Indeed. The ARM AXI bus, for example, supports allocation hints.

I have certainly read about a project for high-speed network routing
where the network cards deliver the packets to L3, and the routing
software has to process each packet in an average of 70ns; if the
packets were delivered to DRAM, that speed would be impossible.

BTDT. Further deponent sayeth not.

As for the "transfer level speed", I would not know why delivering to
DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as slow
as the other variants.

Delivering to SRAM will be more efficient and lower latency.

Crossing boundary that way can typically be predicted far in advance,
so not really big problem.

It does not happen in practice, so making it fast or "optimal" by
using a prediction is not necessary.

Indeed. One generally will not access MMIO space with discrete
(e.g. 64-bit) transactions for bulk data movement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Scott Lurndal on Fri Mar 14 16:51:21 2025

On Fri, 14 Mar 2025 14:46:05 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:

As for the "transfer level speed", I would not know why delivering
to DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as
slow as the other variants.

Transfer level speed would be faster with DMA, because CPU typically
has no way to issue Read requests for chunks of data that are bigger
than 64 bytes.

ARM has load and store instructions that load/store 64 byte chunks
of data. This are primarily aimed at accelerators used to offload
certain computations.

Which does not contradict my statement above. 64 bytes are not bigger
than 64 bytes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Michael S on Fri Mar 14 14:46:05 2025

Michael S <[email protected]> writes:

On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:

As for the "transfer level speed", I would not know why delivering to
DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as slow
as the other variants.

Transfer level speed would be faster with DMA, because CPU typically has
no way to issue Read requests for chunks of data that are bigger than 64 >bytes.

ARM has load and store instructions that load/store 64 byte chunks
of data. This are primarily aimed at accelerators used to offload
certain computations.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Fri Mar 14 15:39:27 2025

On Fri, 14 Mar 2025 14:51:21 +0000, Michael S wrote:

On Fri, 14 Mar 2025 14:46:05 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:

As for the "transfer level speed", I would not know why delivering
to DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as
slow as the other variants.

Transfer level speed would be faster with DMA, because CPU typically
has no way to issue Read requests for chunks of data that are bigger
than 64 bytes.

ARM has load and store instructions that load/store 64 byte chunks
of data. This are primarily aimed at accelerators used to offload
certain computations.

Which does not contradict my statement above. 64 bytes are not bigger
than 64 bytes.

Just a note:

My 66000 core doing Memmove() MM instruction::
unCacheable DRAM can do page at a time transfers
Cacheable DRAM can do line at a time transfers
over the interconnect.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to [email protected] on Fri Mar 14 17:38:06 2025

[email protected] (MitchAlsup1) writes:

On Fri, 14 Mar 2025 14:51:21 +0000, Michael S wrote:

On Fri, 14 Mar 2025 14:46:05 GMT
[email protected] (Scott Lurndal) wrote:

Michael S <[email protected]> writes:

On Fri, 14 Mar 2025 13:18:37 GMT
[email protected] (Anton Ertl) wrote:

As for the "transfer level speed", I would not know why delivering
to DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as
slow as the other variants.

Transfer level speed would be faster with DMA, because CPU typically >>>>has no way to issue Read requests for chunks of data that are bigger >>>>than 64 bytes.

ARM has load and store instructions that load/store 64 byte chunks
of data. This are primarily aimed at accelerators used to offload
certain computations.

Which does not contradict my statement above. 64 bytes are not bigger
than 64 bytes.

Just a note:

My 66000 core doing Memmove() MM instruction::
unCacheable DRAM can do page at a time transfers
Cacheable DRAM can do line at a time transfers
over the interconnect.

The ARM instructions noted above are single-copy-atomic, by design.

ARM chose 64 to match the cache-line size - the loads and stores must
be cache-line aligned.

For networking applications, 128-byte cache lines are more suitable
to support the smallest IP packets (64-bytes + ethernet + IP + TCP headers).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Fri Mar 14 16:37:08 2025

Anton Ertl [2025-03-14 13:18:37] wrote:

The usual case where "from" is memory-mapped I/O and "to" is
cache-coherent is when loading from an NVME SSD. AFAIK this is
usually done in larger block sizes, because of the overhead of setting
up the DMA, and is usually done in an asynchronous way.

Also, is the data transfer separate from the "disk" access? I'd expect
that the NVMe interface lets the CPU say "read block B and DMA it to
DRAM at address X" (after which we get an interrupt), so there is no opportunity for a `rep movsb` or `MM` instruction to do part of the job.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Mar 14 22:23:03 2025

On Fri, 14 Mar 2025 20:37:08 +0000, Stefan Monnier wrote:

Anton Ertl [2025-03-14 13:18:37] wrote:

The usual case where "from" is memory-mapped I/O and "to" is
cache-coherent is when loading from an NVME SSD. AFAIK this is
usually done in larger block sizes, because of the overhead of setting
up the DMA, and is usually done in an asynchronous way.

Also, is the data transfer separate from the "disk" access? I'd expect
that the NVMe interface lets the CPU say "read block B and DMA it to
DRAM at address X" (after which we get an interrupt), so there is no opportunity for a `rep movsb` or `MM` instruction to do part of the job.

Correct, REP-MOVS or MM is simply the DMA device that happens to reside
in a core rather than down the PCIe tree on a device. $3.00 devices have
them, why not $3,000.00 CPU chips ??

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to Lurndal on Sun Mar 16 14:27:00 2025

In article <nOJAP.403742$[email protected]>, [email protected] (Scott Lurndal) wrote:

One place I worked, we serialized checkins using a rubber chicken,
which was hung on the cube wall.

The VAX my employers started with, well before my time, could handle
everyone editing and checking in/out, but could handle a maximum of two instances of the product in its test harness running simultaneously.
There were two small flags, which you had to have one of you launched the
test harness.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:36:06 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	86:09:49
Calls:	12,454
Calls today:	4
Files:	15,195
Messages:	6,537,809

Re: rep movsb vs. simpler instructions for memcpy/memmove

Who's Online

Recent Visitors

System Info