• Re: rep movsb vs. simpler instructions for memcpy/memmove

    From MitchAlsup1@21:1/5 to Anton Ertl on Wed Mar 12 17:44:11 2025
    On Wed, 12 Mar 2025 16:46:36 +0000, Anton Ertl wrote:

    So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
    still the biggest implementation, but many others are quite a bit
    bigger than the 0x113=275 bytes of my ssememmove.

    Imagine that:: almost a page for memmove entry points.

    My 66000 has MM memmove as an instruction (4-bytes) always
    optimal, no checking required.


    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Thu Mar 13 00:03:51 2025
    On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

    My 66000 has MM memmove as an instruction (4-bytes) always optimal, no checking required.

    Presumably interruptible and resumable ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Mar 13 00:49:47 2025
    On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

    My 66000 has MM memmove as an instruction (4-bytes) always optimal, no
    checking required.

    Presumably interruptible and resumable ...

    Yep; but also include able to take exceptions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lawrence D'Oliveiro@21:1/5 to All on Thu Mar 13 02:34:11 2025
    On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

    On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

    My 66000 has MM memmove as an instruction (4-bytes) always optimal, no
    checking required.

    Presumably interruptible and resumable ...

    Yep; but also include able to take exceptions.

    So you have a VAX-style “first part done” processor status bit? And you
    use architectural registers to save/restore the state of an instruction in progress at the time of an interrupt?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Lawrence D'Oliveiro on Thu Mar 13 01:48:01 2025
    Lawrence D'Oliveiro wrote:
    On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

    On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

    My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>> checking required.
    Presumably interruptible and resumable ...
    Yep; but also include able to take exceptions.

    So you have a VAX-style “first part done” processor status bit? And you use architectural registers to save/restore the state of an instruction in progress at the time of an interrupt?

    A safe buffer move doesn't need a FPD flag (VAX) or direction (x86)
    as long as (a) you don't specify the order bytes are actually moved and
    (b) you only specify that at the end the length register will be 0
    and the buffer address values are unspecified.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 13 09:40:09 2025
    So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
    still the biggest implementation, but many others are quite a bit
    bigger than the 0x113=275 bytes of my ssememmove.
    Imagine that:: almost a page for memmove entry points.
    My 66000 has MM memmove as an instruction (4-bytes) always
    optimal, no checking required.

    What is different about MM compared to `rep movsb` that you can
    confidently state that it will always be optimal?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Stefan Monnier on Thu Mar 13 16:23:47 2025
    On Thu, 13 Mar 2025 09:40:09 -0400
    Stefan Monnier <[email protected]> wrote:

    So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
    still the biggest implementation, but many others are quite a bit
    bigger than the 0x113=275 bytes of my ssememmove.
    Imagine that:: almost a page for memmove entry points.
    My 66000 has MM memmove as an instruction (4-bytes) always
    optimal, no checking required.

    What is different about MM compared to `rep movsb` that you can
    confidently state that it will always be optimal?


    Stefan

    Paper is different from silicon. Far superior.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Lawrence D'Oliveiro on Thu Mar 13 16:29:54 2025
    On Thu, 13 Mar 2025 02:34:11 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

    On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

    My 66000 has MM memmove as an instruction (4-bytes) always
    optimal, no checking required.

    Presumably interruptible and resumable ...

    Yep; but also include able to take exceptions.

    So you have a VAX-style “first part done” processor status bit? And
    you use architectural registers to save/restore the state of an
    instruction in progress at the time of an interrupt?

    According to my understanding, no and no.
    Mitch has instruction that saves architectural+microarchitectural
    context in memory and any interrupt or exception has to use it.
    Architectural part of saved buffer is documented. Microarchitectural
    part, apart from its size, not so much.
    That is, according to my understanding. Take it with amount of salt you
    find appropriate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Mar 13 16:10:46 2025
    On Thu, 13 Mar 2025 2:34:11 +0000, Lawrence D'Oliveiro wrote:

    On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

    On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

    My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>> checking required.

    Presumably interruptible and resumable ...

    Yep; but also include able to take exceptions.

    So you have a VAX-style “first part done” processor status bit?

    In effect, yes; however it is a valid bit on a "remaining count"
    control register.

    And you
    use architectural registers to save/restore the state of an instruction
    in progress at the time of an interrupt?

    Technically, registers are saved and reloaded as if the RF was
    4 lines of write back cache--5 lines if you count the thread header.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Thu Mar 13 16:19:53 2025
    On Thu, 13 Mar 2025 14:29:54 +0000, Michael S wrote:

    On Thu, 13 Mar 2025 02:34:11 -0000 (UTC)
    Lawrence D'Oliveiro <[email protected]d> wrote:

    On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

    On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

    My 66000 has MM memmove as an instruction (4-bytes) always
    optimal, no checking required.

    Presumably interruptible and resumable ...

    Yep; but also include able to take exceptions.

    So you have a VAX-style “first part done” processor status bit? And
    you use architectural registers to save/restore the state of an
    instruction in progress at the time of an interrupt?

    According to my understanding, no and no.
    Mitch has instruction

    s/instruction/hardware means/

    that saves architectural+microarchitectural
    context in memory and any interrupt or exception has to use it.

    Technically, you don't have to use it; it happens automatically.
    In one instant you are executing thread[k] in core[j], the next
    instant you are executing thread[m] in core[j] without SW over-
    head. Thread[j] and [k] are not related and share no state.

    Architectural part of saved buffer is documented. Microarchitectural
    part, apart from its size, not so much.
    That is, according to my understanding. Take it with amount of salt you
    find appropriate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Robert Finch on Thu Mar 13 16:15:54 2025
    On Thu, 13 Mar 2025 9:10:36 +0000, Robert Finch wrote:

    On 2025-03-13 1:48 a.m., EricP wrote:
    Lawrence D'Oliveiro wrote:
    On Thu, 13 Mar 2025 00:49:47 +0000, MitchAlsup1 wrote:

    On Thu, 13 Mar 2025 0:03:51 +0000, Lawrence D'Oliveiro wrote:

    On Wed, 12 Mar 2025 17:44:11 +0000, MitchAlsup1 wrote:

    My 66000 has MM memmove as an instruction (4-bytes) always optimal, no >>>>>> checking required.
    Presumably interruptible and resumable ...
    Yep; but also include able to take exceptions.

    So you have a VAX-style “first part done” processor status bit? And
    you use architectural registers to save/restore the state of an
    instruction in progress at the time of an interrupt?

    A safe buffer move doesn't need a FPD flag (VAX) or direction (x86)
    as long as (a) you don't specify the order bytes are actually moved and
    (b) you only specify that at the end the length register will be 0
    and the buffer address values are unspecified.


    If it executes in the background with its own local copy of registers it
    does not need to save state. It might need a means to suspend or cancel
    the operation though.

    MM is NOT MOV (Rt)+,(Rf)+

    MM is MOV [Rt,µi<<width],[Rf,µi<<width]
    ADD µi,µi,width

    Where µi is not visible to the thread. Thus Rt and Rf are not modified.

    Since the pointers are not modified, the whole MOV can be bundled up
    and moved (sic) to whatever layer of the cache hierarchy appropriate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Mar 13 16:24:03 2025
    On Thu, 13 Mar 2025 13:40:09 +0000, Stefan Monnier wrote:

    So __memmove_ssse3 is no longer that big ("only" 3901 bytes); it's
    still the biggest implementation, but many others are quite a bit
    bigger than the 0x113=275 bytes of my ssememmove.
    Imagine that:: almost a page for memmove entry points.
    My 66000 has MM memmove as an instruction (4-bytes) always
    optimal, no checking required.

    What is different about MM compared to `rep movsb`


    MM does not modify the pointers. MM keeps its current index,
    thus the compiler can use the Rf pointer multiple times.

    that you can
    confidently state that it will always be optimal?

    Compared to the explosion in memmove() subroutine, yes.
    Compared to a device living on My 66000 interconnect, maybe.
    Compared to executing instructions on a core, yes.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 13 12:43:07 2025
    What is different about MM compared to `rep movsb`
    MM does not modify the pointers. MM keeps its current index,
    thus the compiler can use the Rf pointer multiple times.
    that you can confidently state that it will always be optimal?
    Compared to the explosion in memmove() subroutine, yes.

    Are you suggesting that what prevents Intel to make `rep movsb` optimal
    is the fact that it modifies its pointers?

    I have no experience implementing such an instruction, but I find it odd
    that such a "cosmetic detail" would have such an profound impact on the performance of an instruction. Can't they just "macroexpand" it during decoding into two instructions (one which copies the bytes without
    modifying the pointers, and then one which just adjusts the pointers)?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Mar 13 19:35:33 2025
    On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:

    What is different about MM compared to `rep movsb`
    MM does not modify the pointers. MM keeps its current index,
    thus the compiler can use the Rf pointer multiple times.
    that you can confidently state that it will always be optimal?
    Compared to the explosion in memmove() subroutine, yes.

    Are you suggesting that what prevents Intel to make `rep movsb` optimal
    is the fact that it modifies its pointers?

    Certainly does not help.

    But they never really "tried all that hard" to make them continuously
    Optimal.

    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other self
    imposed problems. Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs. All of which do no-
    thing to make optimality easier.

    So, at a certain point in time, designers punt. If all competing
    parties punt, nobody is put asunder.

    I have no experience implementing such an instruction, but I find it odd
    that such a "cosmetic detail" would have such an profound impact on the performance of an instruction. Can't they just "macroexpand" it during decoding into two instructions (one which copies the bytes without
    modifying the pointers, and then one which just adjusts the pointers)?

    My 66000 happens to know that memory space changes will not happen
    in the middle of these kinds of things (including vectorized Loops).
    My compilers don't create such problems for HW to solve. {That is;
    the truly horrific x86 optimality problems don't exist.}

    You may choose differently.

    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 13 15:53:25 2025
    MitchAlsup1 [2025-03-13 19:35:33] wrote:
    [...]
    On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:
    What is different about MM compared to `rep movsb`
    [...]
    But they never really "tried all that hard" to make them
    continuously Optimal.

    But is there a reason to presume an implementer of My 66000 would have
    the luxury of putting more efforts into making MM "optimal" than Intel put
    into making `rep movsb`?

    And they have "So Many" extra burdens,

    Ah, now you seem to be getting to the kind of answer I was looking for.

    such as when from is MMI/O space access and to is cache coherent, and
    all sorts of other self imposed problems. Using MTRRs one can switch
    the kind of memory to and from point in the middle of a REP MOVs.
    All of which do nothing to make optimality easier.

    How does MM avoid those complexities?

    My 66000 happens to know that memory space changes will not happen
    in the middle of these kinds of things (including vectorized Loops).

    How does it know? Is it because the ISA just says "don't do that" (I
    guess MM would then signal an error if it happens?), or is there some underlying difference to the way the semantics/cachability of memory
    pages is specified which makes it impossible to specify a memory range
    to MM where the semantics changes partways?

    My compilers don't create such problems for HW to solve. {That is;
    the truly horrific x86 optimality problems don't exist.}

    How do compilers getting in the picture? I thought they were basically ignorant of such subtleties of memory caching, as controlled by MTRRs.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Thu Mar 13 22:55:16 2025
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other self
    imposed problems.

    This case is pretty useful in practice.

    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.

    How exactly?

    All of which do no-
    thing to make optimality easier.

    So, at a certain point in time, designers punt. If all competing
    parties punt, nobody is put asunder.

    I have no experience implementing such an instruction, but I find
    it odd that such a "cosmetic detail" would have such an profound
    impact on the performance of an instruction. Can't they just
    "macroexpand" it during decoding into two instructions (one which
    copies the bytes without modifying the pointers, and then one which
    just adjusts the pointers)?

    My 66000 happens to know that memory space changes will not happen
    in the middle of these kinds of things (including vectorized Loops).
    My compilers don't create such problems for HW to solve. {That is;
    the truly horrific x86 optimality problems don't exist.}


    How high are you aiming?
    How many bytes per clock when source and destination do not overlap
    and both reside in L1D$ ?
    How many bytes when one side in L1D$ and another in L2$?

    If the answer is less than 50 in the first case and less than 30 in the
    2nd case then your are aiming uninterestingly low.

    You may choose differently.

    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Mar 13 20:59:26 2025
    On Thu, 13 Mar 2025 19:53:25 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-13 19:35:33] wrote:
    [...]
    On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:
    What is different about MM compared to `rep movsb`
    [...]
    But they never really "tried all that hard" to make them
    continuously Optimal.

    But is there a reason to presume an implementer of My 66000 would have
    the luxury of putting more efforts into making MM "optimal" than Intel
    put
    into making `rep movsb`?

    In one place we worked, there was a life sized plastic turtle (1-ft
    and 2 pounds). Any the engineer who made the least amount of forward
    progress every week was assigned the turtle at the corner of his/her
    cubicle.

    We found this "motivating"

    And then there would be "me" assessing their accomplishments and
    making fast MM and MS a priority goal to "him".

    And they have "So Many" extra burdens,

    Ah, now you seem to be getting to the kind of answer I was looking for.

    such as when from is MMI/O space access and to is cache coherent, and
    all sorts of other self imposed problems. Using MTRRs one can switch
    the kind of memory to and from point in the middle of a REP MOVs.
    All of which do nothing to make optimality easier.

    How does MM avoid those complexities?

    Compiler only produces MM and MS where the memory is known to be
    contiguous,
    and My 66000 universal address space is 64-bits in width for each kind
    of
    address space, so no MM can cross such a boundary (unless the GuestOS
    is aiming a gun at its feet mucking with the PTEs).

    My 66000 happens to know that memory space changes will not happen
    in the middle of these kinds of things (including vectorized Loops).

    How does it know?

    4 × 64-bit PASs
    1 × 64-bit VAS

    And how the compiler selects using MM or MS.
    So, the compiler and Guest OS have to both make different mistakes that
    is then not caught by Hypervisor translation tables.

    Is it because the ISA just says "don't do that" (I
    guess MM would then signal an error if it happens?), or is there some underlying difference to the way the semantics/cachability of memory
    pages is specified which makes it impossible to specify a memory range
    to MM where the semantics changes partways?

    My compilers don't create such problems for HW to solve. {That is;
    the truly horrific x86 optimality problems don't exist.}

    How do compilers getting in the picture? I thought they were basically ignorant of such subtleties of memory caching, as controlled by MTRRs.

    The compiler uses MM for copying one chunk of virtually contiguous
    memory to another chunk of vcm.

    Compiler would not do this if there is any non-contiguousness. So, from
    VAS, the access is well defined and compactly described.

    During the performance of MM, a change in address space can fault
    the performance allowing somebody more privileged to investigate.



    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Thu Mar 13 21:42:25 2025
    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other self
    imposed problems.

    This case is pretty useful in practice.

    Although mostly done with DMA controllers in these modern times
    to offload from the CPU.


    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.

    How exactly?

    The REP MOV straddles the boundary between two MTRRs.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 13 17:44:53 2025
    MitchAlsup1 [2025-03-13 20:59:26] wrote:
    On Thu, 13 Mar 2025 19:53:25 +0000, Stefan Monnier wrote:
    How does MM avoid those complexities?

    Compiler only produces MM and MS where the memory is known to be
    contiguous, and My 66000 universal address space is 64-bits in width
    for each kind of address space, so no MM can cross such a boundary
    (unless the GuestOS is aiming a gun at its feet mucking with the
    PTEs).

    What is `MS`?
    Isn't it also the case in `rep movsb`?
    At least assuming an OS like Linux?

    My 66000 happens to know that memory space changes will not happen
    in the middle of these kinds of things (including vectorized Loops).
    How does it know?
    4 � 64-bit PASs
    1 � 64-bit VAS

    Sorry, I think you went too fast, you lost me here. I'm just a poor
    compiler guy with a side-interest in computer architecture.

    Presumably the needs that MTRRs satisfy can also be satisfied in My
    66000, so I guess what I'm missing here is how My 66000's solution is
    different from the amd64/i386 one and how that ends up providing MM with
    a guarantee that it doesn't need to care?

    It seems to me, there might still be cases where a My 66000 system might
    want to copy bytes between a network card buffer and DRAM, so while
    I don't expect the cachability of either source or destination to change
    in the middle of an MM operation (and I similarly would be fine with
    a `rep movsb` that becomes slow if this ever happens), I do expect MM operations to transfer data between areas that don't have the
    same cachability.

    The compiler uses MM for copying one chunk of virtually contiguous
    memory to another chunk of vcm.
    Compiler would not do this if there is any non-contiguousness. So, from
    VAS, the access is well defined and compactly described.

    In which way does this not also apply to `rep movsb`?

    During the performance of MM, a change in address space can fault
    the performance allowing somebody more privileged to investigate.

    What do you mean by "change in address space"?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 13 17:48:48 2025
    Scott Lurndal [2025-03-13 21:42:25] wrote:
    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:
    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.
    How exactly?
    The REP MOV straddles the boundary between two MTRRs.

    Why/when would this happen in practice?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Thu Mar 13 22:07:08 2025
    On Thu, 13 Mar 2025 21:06:02 +0000, Michael S wrote:

    On Wed, 12 Mar 2025 16:46:36 GMT
    [email protected] (Anton Ertl) wrote:
    -------------

    Idiots from corporate IT blocked http://al.howardknight.net/

    I feel with you. In my workplace, Usenet is blocked (probably
    unintentionally). I have to post from home.

    So, link to google groups

    Sorry, I cannot provide that service. Trying to access
    groups.google.com tells me:

    |Couldn’t sign you in
    |
    |The browser you’re using doesn’t support JavaScript, or has >> JavaScript |turned off.
    |
    |To keep your Google Account secure, try signing in on a browser that
    |has JavaScript turned on.

    I certainly won't turn on JavaScript for Google, and apparently Google
    wants me to log in to a Google account to access groups.google.com. I
    don't have a Google account and I don't want one.


    For me it works fine without login. But not without JS.
    For those who are willing to use JS, the link: https://groups.google.com/g/comp.arch/c/ULvFgEM_ZSY/m/ysPySToGAwAJ

    Prior to the attack 9 months ago:: Google Groups was happy
    to use my AOL.email.address. I have not tried recently.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Fri Mar 14 00:16:19 2025
    On Thu, 13 Mar 2025 21:42:25 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other self
    imposed problems.

    This case is pretty useful in practice.

    Although mostly done with DMA controllers in these modern times
    to offload from the CPU.


    For up to few hundreds bytes it would be slower. For few thousands byte
    it could be faster at transfer level, but data ends up in the wrong
    place in the memory hierarchy, too far away from the ultimate consumer,
    so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just uselessly
    wasted in idle loop.


    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.

    How exactly?

    The REP MOV straddles the boundary between two MTRRs.


    Crossing boundary that way can typically be predicted far in advance,
    so not really big problem.
    I think, Mitch had something less mundane in mind.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Stefan Monnier on Thu Mar 13 23:25:23 2025
    Stefan Monnier <[email protected]> writes:
    Scott Lurndal [2025-03-13 21:42:25] wrote:
    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:
    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.
    How exactly?
    The REP MOV straddles the boundary between two MTRRs.

    Why/when would this happen in practice?

    Nobody said it was a good idea.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Thu Mar 13 23:24:03 2025
    [email protected] (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 19:53:25 +0000, Stefan Monnier wrote:

    MitchAlsup1 [2025-03-13 19:35:33] wrote:
    [...]
    On Thu, 13 Mar 2025 16:43:07 +0000, Stefan Monnier wrote:
    What is different about MM compared to `rep movsb`
    [...]
    But they never really "tried all that hard" to make them
    continuously Optimal.

    But is there a reason to presume an implementer of My 66000 would have
    the luxury of putting more efforts into making MM "optimal" than Intel
    put
    into making `rep movsb`?

    In one place we worked, there was a life sized plastic turtle (1-ft
    and 2 pounds). Any the engineer who made the least amount of forward
    progress every week was assigned the turtle at the corner of his/her
    cubicle.

    One place I worked, we serialized checkins using a rubber chicken,
    which was hung on the cube wall.


    We found this "motivating"

    We found the checkin chicken humorous.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Thu Mar 13 23:27:16 2025
    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 21:42:25 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other self
    imposed problems.

    This case is pretty useful in practice.

    Although mostly done with DMA controllers in these modern times
    to offload from the CPU.


    For up to few hundreds bytes it would be slower. For few thousands byte
    it could be faster at transfer level, but data ends up in the wrong
    place in the memory hierarchy, too far away from the ultimate consumer,

    Most systems I work with have an 'allocate' attribute on
    inbound DMA operations that will allocate in a specified
    cache level (typically LLC).

    Most DMA's are far more a hundred bytes, and the application
    can be doing something else while the DMA is in process.


    so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just uselessly >wasted in idle loop.

    Only by incompetent programmers.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Scott Lurndal on Fri Mar 14 00:19:41 2025
    On Thu, 13 Mar 2025 23:25:23 +0000, Scott Lurndal wrote:

    Stefan Monnier <[email protected]> writes:
    Scott Lurndal [2025-03-13 21:42:25] wrote:
    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:
    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.
    How exactly?
    The REP MOV straddles the boundary between two MTRRs.

    Why/when would this happen in practice?

    Nobody said it was a good idea.

    I can envision an attack strategy using this to "confuse"
    someone in the higher privilege levels of the "system"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Fri Mar 14 00:18:18 2025
    On Thu, 13 Mar 2025 22:16:19 +0000, Michael S wrote:

    On Thu, 13 Mar 2025 21:42:25 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other self
    imposed problems.

    This case is pretty useful in practice.

    Although mostly done with DMA controllers in these modern times
    to offload from the CPU.


    For up to few hundreds bytes it would be slower. For few thousands byte
    it could be faster at transfer level, but data ends up in the wrong
    place in the memory hierarchy, too far away from the ultimate consumer,
    so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just uselessly wasted in idle loop.


    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.

    How exactly?

    The REP MOV straddles the boundary between two MTRRs.


    Crossing boundary that way can typically be predicted far in advance,
    so not really big problem.
    I think, Mitch had something less mundane in mind.

    I was just trying to illustrate why optimal REP-MOVS is more difficult
    than a SW person might initially guestimate.

    One side might be a byte array down PCIe tree in config space,
    while the destination is a line access only. Yeah, just try to
    do this optimally.

    Since MM is available in the interconnect protocol, one could
    imagine one PCIe device transferring a page to another PCIe
    device without the data stream ever touching DRAM (or L3).
    Yeah, just try doing that with MTRRs and system MMUs in the
    way. ...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 13 21:08:38 2025
    The REP MOV straddles the boundary between two MTRRs.
    Why/when would this happen in practice?
    Nobody said it was a good idea.

    But if it doesn't happen in normal cases, then it shouldn't be
    significant to performance. So is the problem that just detecting the occurrence of this situation is already too costly to make `rep
    movsb` fast?

    [ Of course, I still haven't understood either why it technically can
    happen in amd64 but not in My 66000. ]


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Mar 14 01:30:06 2025
    On Fri, 14 Mar 2025 1:08:38 +0000, Stefan Monnier wrote:

    The REP MOV straddles the boundary between two MTRRs.
    Why/when would this happen in practice?
    Nobody said it was a good idea.

    But if it doesn't happen in normal cases, then it shouldn't be
    significant to performance. So is the problem that just detecting the occurrence of this situation is already too costly to make `rep
    movsb` fast?

    A camel's back is only so strong.

    Conjecture that there are 14-different kinds of memory on both
    source and destination. So we need a 14×14 check on where we are
    every cycle, or every time a boundary could be crossed.

    Now, µCode (or HW sequencer) needs to check certain things at
    certain boundaries, and switch optimal[DRAM,DRAM] to
    optimal[Streaming-store, PCIe-config-space] on a cycle's notice,
    while adjusting its memory model from "causal" to strongly ordered.

    [ Of course, I still haven't understood either why it technically can
    happen in amd64 but not in My 66000. ]

    The cartesian product is smaller, more amenable to buffering and
    caching, with more easily discovered (or eliminated) boundaries.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Thu Mar 13 22:20:16 2025
    [ Of course, I still haven't understood either why it technically can
    happen in amd64 but not in My 66000. ]
    The cartesian product is smaller, more amenable to buffering and
    caching, with more easily discovered (or eliminated) boundaries.

    These are the parts I can guess. But the part I don't get is what makes
    the factors of your cartesian product smaller, what makes your CPU
    more amenable to buffering and caching, and what makes those boundaries
    easier to discover or eliminate in My 66000 than in amd64.

    From what I have gathered so far, the difference in optimizability
    between `MM` and `rep movsb` is not due to the semantics of the
    instruction, but in the rest of the CPU.

    I guess part of my question is: would an `MM` instruction added to, say,
    RISC-V or ARM be as easy to optimize as for My 66000 or would it be more
    like for the amd64?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Fri Mar 14 12:10:23 2025
    On Thu, 13 Mar 2025 23:27:16 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 21:42:25 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is
    MMI/O space access and to is cache coherent, and all sorts of
    other self imposed problems.

    This case is pretty useful in practice.

    Although mostly done with DMA controllers in these modern times
    to offload from the CPU.


    For up to few hundreds bytes it would be slower. For few thousands
    byte it could be faster at transfer level, but data ends up in the
    wrong place in the memory hierarchy, too far away from the ultimate >consumer,

    Most systems I work with have an 'allocate' attribute on
    inbound DMA operations that will allocate in a specified
    cache level (typically LLC).

    Most DMA's are far more a hundred bytes, and the application
    can be doing something else while the DMA is in process.


    so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just
    uselessly wasted in idle loop.

    Only by incompetent programmers.

    It has nothing to do with competence of programmers and everything to
    do with modern computers having more cores than their users need.
    This applies not only to client system but to at least 3/4th of the
    servers as well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Fri Mar 14 12:06:12 2025
    On Fri, 14 Mar 2025 00:18:18 +0000
    [email protected] (MitchAlsup1) wrote:



    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.

    How exactly?

    The REP MOV straddles the boundary between two MTRRs.


    Crossing boundary that way can typically be predicted far in
    advance, so not really big problem.
    I think, Mitch had something less mundane in mind.

    I was just trying to illustrate why optimal REP-MOVS is more difficult
    than a SW person might initially guestimate.


    This particular SW person thinks that availability of microcode +
    microtraps simplifies handling of corner cases correctly in a way that
    does not affect the speed of common cases. The only really hard part is
    how to reduce startup overhead.

    One side might be a byte array down PCIe tree in config space,
    while the destination is a line access only. Yeah, just try to
    do this optimally.


    There is no need to do this particular case optimally.

    Since MM is available in the interconnect protocol, one could
    imagine one PCIe device transferring a page to another PCIe
    device without the data stream ever touching DRAM (or L3).
    Yeah, just try doing that with MTRRs and system MMUs in the
    way. ...

    Outside of graphics drivers for exotic multi-GPU setups, I don't see it happening for the reasons not related to HW. A system software is not structured in a way that makes it possible.
    And multi-GPU setups were all rage 20 years ago, much less so today.
    That is, today people use multiple GPUs more than ever, but they are
    used to run independent jobs rather than to co-operate.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Fri Mar 14 14:00:37 2025
    On Fri, 14 Mar 2025 12:52:02 +0100
    Terje Mathisen <[email protected]> wrote:

    Michael S wrote:
    On Thu, 13 Mar 2025 21:42:25 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other
    self imposed problems.

    This case is pretty useful in practice.

    Although mostly done with DMA controllers in these modern times
    to offload from the CPU.


    For up to few hundreds bytes it would be slower. For few thousands
    byte it could be faster at transfer level, but data ends up in the
    wrong place in the memory hierarchy, too far away from the ultimate consumer, so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just
    uselessly wasted in idle loop.


    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.

    How exactly?

    The REP MOV straddles the boundary between two MTRRs.


    Crossing boundary that way can typically be predicted far in
    advance, so not really big problem.
    I think, Mitch had something less mundane in mind.

    Yeah, I read it as some other core modifying the relevant MTTRs in
    the middle of the ongoing block move.



    MTRRs are not in memory. They are MSRs, each HW thread has its own set.
    So, AFAIK, modification by other core/thread is not possible.

    The solution seems somewhat obvious, i.e any modification of an MTTR
    which is involved in the move wil cause a hw interrupt. Upon
    restarting the remainder of the move, the new MTTR rules apply?

    The alternative would be to specify that any block move is atomic as
    seen from the MTTR rules, i.e the update(s) only apply after the move
    has finished?

    Terje


    May be, for some other complications. For MTRRs I see no need.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Michael S on Fri Mar 14 12:52:02 2025
    Michael S wrote:
    On Thu, 13 Mar 2025 21:42:25 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other self
    imposed problems.

    This case is pretty useful in practice.

    Although mostly done with DMA controllers in these modern times
    to offload from the CPU.


    For up to few hundreds bytes it would be slower. For few thousands byte
    it could be faster at transfer level, but data ends up in the wrong
    place in the memory hierarchy, too far away from the ultimate consumer,
    so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just uselessly wasted in idle loop.


    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.

    How exactly?

    The REP MOV straddles the boundary between two MTRRs.


    Crossing boundary that way can typically be predicted far in advance,
    so not really big problem.
    I think, Mitch had something less mundane in mind.

    Yeah, I read it as some other core modifying the relevant MTTRs in the
    middle of the ongoing block move.

    The solution seems somewhat obvious, i.e any modification of an MTTR
    which is involved in the move wil cause a hw interrupt. Upon restarting
    the remainder of the move, the new MTTR rules apply?

    The alternative would be to specify that any block move is atomic as
    seen from the MTTR rules, i.e the update(s) only apply after the move
    has finished?

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Cross@21:1/5 to [email protected] on Fri Mar 14 11:15:45 2025
    In article <[email protected]>,
    Michael S <[email protected]> wrote:
    On Thu, 13 Mar 2025 23:27:16 GMT
    [email protected] (Scott Lurndal) wrote:
    Michael S <[email protected]> writes:
    [snip]
    For up to few hundreds bytes it would be slower. For few thousands
    byte it could be faster at transfer level, but data ends up in the
    wrong place in the memory hierarchy, too far away from the ultimate
    consumer,

    Most systems I work with have an 'allocate' attribute on
    inbound DMA operations that will allocate in a specified
    cache level (typically LLC).

    Most DMA's are far more a hundred bytes, and the application
    can be doing something else while the DMA is in process.

    so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just
    uselessly wasted in idle loop.

    Only by incompetent programmers.

    It has nothing to do with competence of programmers and everything to
    do with modern computers having more cores than their users need.
    This applies not only to client system but to at least 3/4th of the
    servers as well.

    Define "need" though. Many users are running programs that do
    make use of those resources, and giving them up for IO would be
    a poor tradeoff. After all, it takes a lot of juice to have
    rounded corners on a window where someone's watching a 1080p
    video of a cat chasing a paper airplane.

    One might argue that that is the thing the user does not need to
    do, but that's the user's perogative, not ours.

    - Dan C.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Fri Mar 14 12:25:32 2025
    Michael S <[email protected]> writes:
    On Fri, 14 Mar 2025 00:18:18 +0000
    [email protected] (MitchAlsup1) wrote:
    Since MM is available in the interconnect protocol, one could
    imagine one PCIe device transferring a page to another PCIe
    device without the data stream ever touching DRAM (or L3).
    Yeah, just try doing that with MTRRs and system MMUs in the
    way. ...

    Outside of graphics drivers for exotic multi-GPU setups, I don't see it >happening for the reasons not related to HW.

    Since a few years, there is some buzz (maybe it's just a marketing
    feature) about loading textures etc. directly from the SSD into the
    graphics memory. However, the programs that do this know that they
    are doing this and don't do it with a synchronous instruction like rep
    movsb or MM (I expect that MM is also synchronous).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Fri Mar 14 13:18:37 2025
    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 21:42:25 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:


    And they have "So Many" extra burdens, such as when from is MMI/O
    space access and to is cache coherent, and all sorts of other self
    imposed problems.

    This case is pretty useful in practice.

    Although mostly done with DMA controllers in these modern times
    to offload from the CPU.


    For up to few hundreds bytes it would be slower. For few thousands byte
    it could be faster at transfer level, but data ends up in the wrong
    place in the memory hierarchy, too far away from the ultimate consumer,
    so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just uselessly >wasted in idle loop.

    The usual case where "from" is memory-mapped I/O and "to" is
    cache-coherent is when loading from an NVME SSD. AFAIK this is
    usually done in larger block sizes, because of the overhead of setting
    up the DMA, and is usually done in an asynchronous way.

    As for the wrong level: The DMA engine transfers the data to the CPU
    chip in any case: it contains all caches and the DRAM controller. It
    might put the data in, e.g., L3 cache, marked dirty, for later
    writeback to DRAM, and if a CPU accesses that memory soon, it will
    only see the latency and bandwidth limits of L3.

    I have certainly read about a project for high-speed network routing
    where the network cards deliver the packets to L3, and the routing
    software has to process each packet in an average of 70ns; if the
    packets were delivered to DRAM, that speed would be impossible.

    As for the "transfer level speed", I would not know why delivering to
    DRAM should be faster than delivering to L3, L2, or L1. On the
    contrary, it seems to me that delivering to DRAM is at least as slow
    as the other variants.

    In any case, that's not what most uses of memcpy() or memmove(), or
    rep movsb with their synchronous interfaces are about.

    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.

    How exactly?

    The REP MOV straddles the boundary between two MTRRs.


    Crossing boundary that way can typically be predicted far in advance,
    so not really big problem.

    It does not happen in practice, so making it fast or "optimal" by
    using a prediction is not necessary.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Fri Mar 14 14:35:13 2025
    [email protected] (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 23:25:23 +0000, Scott Lurndal wrote:

    Stefan Monnier <[email protected]> writes:
    Scott Lurndal [2025-03-13 21:42:25] wrote:
    Michael S <[email protected]> writes:
    On Thu, 13 Mar 2025 19:35:33 +0000
    [email protected] (MitchAlsup1) wrote:
    Using MTRRs one can switch the kind of memory
    to and from point in the middle of a REP MOVs.
    How exactly?
    The REP MOV straddles the boundary between two MTRRs.

    Why/when would this happen in practice?

    Nobody said it was a good idea.

    I can envision an attack strategy using this to "confuse"
    someone in the higher privilege levels of the "system"

    generally the MTRR's control the cachability, not the privilege.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Fri Mar 14 14:38:19 2025
    [email protected] (MitchAlsup1) writes:
    On Thu, 13 Mar 2025 22:16:19 +0000, Michael S wrote:



    Since MM is available in the interconnect protocol, one could
    imagine one PCIe device transferring a page to another PCIe
    device without the data stream ever touching DRAM (or L3).

    That's called peer-to-peer PCI and isn't exactly uncommon
    in higher-end systems.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Fri Mar 14 16:20:09 2025
    On Fri, 14 Mar 2025 13:18:37 GMT
    [email protected] (Anton Ertl) wrote:


    As for the "transfer level speed", I would not know why delivering to
    DRAM should be faster than delivering to L3, L2, or L1. On the
    contrary, it seems to me that delivering to DRAM is at least as slow
    as the other variants.


    Transfer level speed would be faster with DMA, because CPU typically has
    no way to issue Read requests for chunks of data that are bigger than 64
    bytes.
    OTOH, DMA resides on device itself and uses as big transfer unit as appropriate, up to maximum of 4 KB.
    In theory, "rep movsb" can generate bigger (than 64B) read transfers,
    but I don't belive that by now state of the art is that advanced.
    Besides, on all PCE buses, but especially so on PCIe, write transfers
    (DMA is doing Write transfer in this case) utilizes bus significantly
    better than read transfers. The difference is most pronounced for
    small transfers, but on something like 4-lane PCIe Gen4 the difference
    can be quite big even when Read transactions uses maximal transfer size.

    In any case, that's not what most uses of memcpy() or memmove(), or
    rep movsb with their synchronous interfaces are about.


    Agreed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Anton Ertl on Fri Mar 14 14:44:22 2025
    [email protected] (Anton Ertl) writes:
    Michael S <[email protected]> writes:

    For up to few hundreds bytes it would be slower. For few thousands byte
    it could be faster at transfer level, but data ends up in the wrong
    place in the memory hierarchy, too far away from the ultimate consumer,
    so still slower from the "full job done" perspective.
    And CPU time that you "saved" by offload is almost always just uselessly >>wasted in idle loop.

    The usual case where "from" is memory-mapped I/O and "to" is
    cache-coherent is when loading from an NVME SSD. AFAIK this is
    usually done in larger block sizes, because of the overhead of setting
    up the DMA, and is usually done in an asynchronous way.

    As for the wrong level: The DMA engine transfers the data to the CPU
    chip in any case: it contains all caches and the DRAM controller. It
    might put the data in, e.g., L3 cache, marked dirty, for later
    writeback to DRAM, and if a CPU accesses that memory soon, it will
    only see the latency and bandwidth limits of L3.

    Indeed. The ARM AXI bus, for example, supports allocation hints.


    I have certainly read about a project for high-speed network routing
    where the network cards deliver the packets to L3, and the routing
    software has to process each packet in an average of 70ns; if the
    packets were delivered to DRAM, that speed would be impossible.

    BTDT. Further deponent sayeth not.


    As for the "transfer level speed", I would not know why delivering to
    DRAM should be faster than delivering to L3, L2, or L1. On the
    contrary, it seems to me that delivering to DRAM is at least as slow
    as the other variants.

    Delivering to SRAM will be more efficient and lower latency.


    Crossing boundary that way can typically be predicted far in advance,
    so not really big problem.

    It does not happen in practice, so making it fast or "optimal" by
    using a prediction is not necessary.

    Indeed. One generally will not access MMIO space with discrete
    (e.g. 64-bit) transactions for bulk data movement.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Fri Mar 14 16:51:21 2025
    On Fri, 14 Mar 2025 14:46:05 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Fri, 14 Mar 2025 13:18:37 GMT
    [email protected] (Anton Ertl) wrote:


    As for the "transfer level speed", I would not know why delivering
    to DRAM should be faster than delivering to L3, L2, or L1. On the
    contrary, it seems to me that delivering to DRAM is at least as
    slow as the other variants.


    Transfer level speed would be faster with DMA, because CPU typically
    has no way to issue Read requests for chunks of data that are bigger
    than 64 bytes.

    ARM has load and store instructions that load/store 64 byte chunks
    of data. This are primarily aimed at accelerators used to offload
    certain computations.


    Which does not contradict my statement above. 64 bytes are not bigger
    than 64 bytes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Fri Mar 14 14:46:05 2025
    Michael S <[email protected]> writes:
    On Fri, 14 Mar 2025 13:18:37 GMT
    [email protected] (Anton Ertl) wrote:


    As for the "transfer level speed", I would not know why delivering to
    DRAM should be faster than delivering to L3, L2, or L1. On the
    contrary, it seems to me that delivering to DRAM is at least as slow
    as the other variants.


    Transfer level speed would be faster with DMA, because CPU typically has
    no way to issue Read requests for chunks of data that are bigger than 64 >bytes.

    ARM has load and store instructions that load/store 64 byte chunks
    of data. This are primarily aimed at accelerators used to offload
    certain computations.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Fri Mar 14 15:39:27 2025
    On Fri, 14 Mar 2025 14:51:21 +0000, Michael S wrote:

    On Fri, 14 Mar 2025 14:46:05 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Fri, 14 Mar 2025 13:18:37 GMT
    [email protected] (Anton Ertl) wrote:


    As for the "transfer level speed", I would not know why delivering
    to DRAM should be faster than delivering to L3, L2, or L1. On the
    contrary, it seems to me that delivering to DRAM is at least as
    slow as the other variants.


    Transfer level speed would be faster with DMA, because CPU typically
    has no way to issue Read requests for chunks of data that are bigger
    than 64 bytes.

    ARM has load and store instructions that load/store 64 byte chunks
    of data. This are primarily aimed at accelerators used to offload
    certain computations.


    Which does not contradict my statement above. 64 bytes are not bigger
    than 64 bytes.

    Just a note:

    My 66000 core doing Memmove() MM instruction::
    unCacheable DRAM can do page at a time transfers
    Cacheable DRAM can do line at a time transfers
    over the interconnect.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to [email protected] on Fri Mar 14 17:38:06 2025
    [email protected] (MitchAlsup1) writes:
    On Fri, 14 Mar 2025 14:51:21 +0000, Michael S wrote:

    On Fri, 14 Mar 2025 14:46:05 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Fri, 14 Mar 2025 13:18:37 GMT
    [email protected] (Anton Ertl) wrote:


    As for the "transfer level speed", I would not know why delivering
    to DRAM should be faster than delivering to L3, L2, or L1. On the
    contrary, it seems to me that delivering to DRAM is at least as
    slow as the other variants.


    Transfer level speed would be faster with DMA, because CPU typically >>>>has no way to issue Read requests for chunks of data that are bigger >>>>than 64 bytes.

    ARM has load and store instructions that load/store 64 byte chunks
    of data. This are primarily aimed at accelerators used to offload
    certain computations.


    Which does not contradict my statement above. 64 bytes are not bigger
    than 64 bytes.

    Just a note:

    My 66000 core doing Memmove() MM instruction::
    unCacheable DRAM can do page at a time transfers
    Cacheable DRAM can do line at a time transfers
    over the interconnect.

    The ARM instructions noted above are single-copy-atomic, by design.

    ARM chose 64 to match the cache-line size - the loads and stores must
    be cache-line aligned.

    For networking applications, 128-byte cache lines are more suitable
    to support the smallest IP packets (64-bytes + ethernet + IP + TCP headers).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Fri Mar 14 16:37:08 2025
    Anton Ertl [2025-03-14 13:18:37] wrote:
    The usual case where "from" is memory-mapped I/O and "to" is
    cache-coherent is when loading from an NVME SSD. AFAIK this is
    usually done in larger block sizes, because of the overhead of setting
    up the DMA, and is usually done in an asynchronous way.

    Also, is the data transfer separate from the "disk" access? I'd expect
    that the NVMe interface lets the CPU say "read block B and DMA it to
    DRAM at address X" (after which we get an interrupt), so there is no opportunity for a `rep movsb` or `MM` instruction to do part of the job.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stefan Monnier on Fri Mar 14 22:23:03 2025
    On Fri, 14 Mar 2025 20:37:08 +0000, Stefan Monnier wrote:

    Anton Ertl [2025-03-14 13:18:37] wrote:
    The usual case where "from" is memory-mapped I/O and "to" is
    cache-coherent is when loading from an NVME SSD. AFAIK this is
    usually done in larger block sizes, because of the overhead of setting
    up the DMA, and is usually done in an asynchronous way.

    Also, is the data transfer separate from the "disk" access? I'd expect
    that the NVMe interface lets the CPU say "read block B and DMA it to
    DRAM at address X" (after which we get an interrupt), so there is no opportunity for a `rep movsb` or `MM` instruction to do part of the job.

    Correct, REP-MOVS or MM is simply the DMA device that happens to reside
    in a core rather than down the PCIe tree on a device. $3.00 devices have
    them, why not $3,000.00 CPU chips ??


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Dallman@21:1/5 to Lurndal on Sun Mar 16 14:27:00 2025
    In article <nOJAP.403742$[email protected]>, [email protected] (Scott Lurndal) wrote:

    One place I worked, we serialized checkins using a rubber chicken,
    which was hung on the cube wall.

    The VAX my employers started with, well before my time, could handle
    everyone editing and checking in/out, but could handle a maximum of two instances of the product in its test harness running simultaneously.
    There were two small flags, which you had to have one of you launched the
    test harness.

    John

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)