• AMD Cache speed funny

    From Vir Campestris@21:1/5 to All on Tue Jan 30 16:36:17 2024
    I've knocked up a little utility program to try to work out some
    performance figures for my CPU.

    It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
    4MB L3 cache
    2MB L2 cache
    384kb L1 cache

    What I do is to xor a location in memory in an array many times.
    The size of the area I xor over is set by a mask on the store index.
    The words in the store are 64 bit.

    A C++ fragment is this. I can post the whole thing if it would help.

    // Calculate a bit mask for the entire store
    Word mask = storeWordCount - 1;

    Stopwatch s;
    s.start();
    while (1) // until break when mask runs out
    {
    for (size_t index = 0; index < storeWordCount; ++index)
    {
    // read and write a word in store.
    Raw[index & mask] ^= index;
    }
    s.lap(mask); // records the current time

    if (mask == 0) break; // Stop if we've run out of mask

    mask >>= 1; // shrink the mask
    }

    As you can see it starts with a large mask (in fact for a whole GB) and
    halves it as it goes around.

    All looks fine at first. I get about 8GB per second with a large mask,
    at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
    gets smaller. No apparent effect when it gets under the L1 cache size.

    But...
    When the mask is very small (3) it slows to 18GB/s. With 1 it halves
    again, and with zero (so it only operates on the same word over and
    over) it's half again. A fifth of the size with a large block.

    Something odd is happening here when I hammer the same location (32
    bytes and on down) so that it's slower. Yet this ought to be in the L1
    data cache.

    A late thought was to replace that ^= index with something that reads
    the memory only, or that writes it only, instead of doing a
    read-modify-write cycle. That gives me much faster performance with
    writes than reads. And neither read only, nor write only, show this odd
    slow down with small masks.

    What am I missing?

    Thanks
    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Vir Campestris on Tue Jan 30 19:38:15 2024
    On Tue, 30 Jan 2024 16:36:17 +0000
    Vir Campestris <[email protected]d> wrote:

    I've knocked up a little utility program to try to work out some
    performance figures for my CPU.

    It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
    4MB L3 cache
    2MB L2 cache
    384kb L1 cache


    That's for the whole chip and it includes L1I caches.
    For individual core and excluding L1I the numbers are:
    4MB L3 cache
    512 KB L2 cache
    32 KB L1D cache


    What I do is to xor a location in memory in an array many times.
    The size of the area I xor over is set by a mask on the store index.
    The words in the store are 64 bit.

    A C++ fragment is this. I can post the whole thing if it would help.

    // Calculate a bit mask for the entire store
    Word mask = storeWordCount - 1;

    Stopwatch s;
    s.start();
    while (1) // until break when mask runs out
    {
    for (size_t index = 0; index < storeWordCount; ++index)
    {
    // read and write a word in store.
    Raw[index & mask] ^= index;
    }
    s.lap(mask); // records the current time

    if (mask == 0) break; // Stop if we've run out of mask

    mask >>= 1; // shrink the mask
    }

    As you can see it starts with a large mask (in fact for a whole GB)
    and halves it as it goes around.

    All looks fine at first. I get about 8GB per second with a large
    mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as
    the mask gets smaller. No apparent effect when it gets under the L1
    cache size.

    But...
    When the mask is very small (3) it slows to 18GB/s. With 1 it halves
    again, and with zero (so it only operates on the same word over and
    over) it's half again. A fifth of the size with a large block.

    Something odd is happening here when I hammer the same location (32
    bytes and on down) so that it's slower. Yet this ought to be in the
    L1 data cache.

    A late thought was to replace that ^= index with something that reads
    the memory only, or that writes it only, instead of doing a read-modify-write cycle. That gives me much faster performance with
    writes than reads. And neither read only, nor write only, show this
    odd slow down with small masks.

    What am I missing?

    Thanks
    Andy

    First, I'd look at generated asm.
    If compiler was doing a good job then at mask <= 4095 (32 KB) you should
    see slightly less than 1 iteration of the loop per cycle, i.e. assuming
    4.2 GHz clock, approximately 30 GB/s.
    Since you see less, it's a sign that compiler did less than perfect job.
    Try to help it with manual loop unrolling.

    As to the problem with lower performance at very small masks, it's
    expected. CPU tries to execute loads speculatively out of order under assumption that they don't alias with preceding stores. So actual loads
    runs few loop iterations ahead of the stores. We can't say for sure how
    many iterations ahead, but 7 to 10 iterations sounds like a good guess.
    When your mask=7 (32 bytes) then aliasing starts to happen. On old
    primitive CPUs, like Pentium 4, it causes massive slowdown, because
    those early loads has to be replayed after rather significant delay
    of about 20 cycles (length of pipeline). Your Zen1+ CPU is much smarter,
    it detects that things are no good and stops wild speculations. So, you
    don't see huge slowdown. But without speculation every load starts only
    after all stores that preceded it in program order were either
    committed into L1D cache or their address was checked against the
    speculative load address and no aliasing was found. Since you see only
    mild slowdown, it seems that the later is done rather effectively and
    your CPU is still able to run loads speculatively, but now only 2 or 3
    steps ahead, which is not enough to get the same performance as before.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Vir Campestris on Tue Jan 30 17:20:59 2024
    Vir Campestris <[email protected]d> writes:
    for (size_t index = 0; index < storeWordCount; ++index)
    {
    // read and write a word in store.
    Raw[index & mask] ^= index;
    }
    ...
    When the mask is very small (3) it slows to 18GB/s. With 1 it halves
    again, and with zero (so it only operates on the same word over and
    over) it's half again. A fifth of the size with a large block.

    Something odd is happening here when I hammer the same location (32
    bytes and on down) so that it's slower. Yet this ought to be in the L1
    data cache.

    A late thought was to replace that ^= index with something that reads
    the memory only, or that writes it only, instead of doing a
    read-modify-write cycle. That gives me much faster performance with
    writes than reads. And neither read only, nor write only, show this odd
    slow down with small masks.

    What am I missing?

    When you do

    raw[0] ^= index;

    in every step you read the result of the pervious iteration, xor it,
    and store it again. This means that you have one chain of RMW data dependences, with one RMW per iteration. On the Zen2 (which your
    3400G has), this requires 8 cycles (see column H of <http://www.complang.tuwien.ac.at/anton/memdep/>). With mask=1, you
    get 2 chains, each with one 8-cycle RMW every second iteration, so you
    need 4 cycles per iteration (see my column C). With mask=3, you get 4
    chains and 2 cycles per iteration. Looking at my results, I would
    expect another doubling with mask=7, but maybe your loop is running
    into resource limits at that point (mine does 4 RMWs per iteration).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Tue Jan 30 22:37:05 2024
    On Tue, 30 Jan 2024 20:11:42 +0000
    [email protected] (MitchAlsup1) wrote:

    Vir Campestris wrote:

    As you can see it starts with a large mask (in fact for a whole GB)
    and halves it as it goes around.

    All looks fine at first. I get about 8GB per second with a large
    mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
    as the mask gets smaller. No apparent effect when it gets under the
    L1 cache size.

    The execution window is apparently able to absorb the latency of L3
    miss, and stream L3->L1 accesses.


    That sounds unlikely. L3 latency is too big to be covered by execution
    window. Much more likely they have adequate HW prefetch from L3 to L2
    and may be (less likely) even to L1D.

    Anton answered the question regarding small masks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Vir Campestris on Tue Jan 30 20:11:42 2024
    Vir Campestris wrote:

    As you can see it starts with a large mask (in fact for a whole GB) and halves it as it goes around.

    All looks fine at first. I get about 8GB per second with a large mask,
    at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
    gets smaller. No apparent effect when it gets under the L1 cache size.

    The execution window is apparently able to absorb the latency of L3 miss,
    and stream L3->L1 accesses.

    Anton answered the question regarding small masks.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to Vir Campestris on Wed Jan 31 07:59:41 2024
    Vir Campestris wrote:
    I've knocked up a little utility program to try to work out some
    performance figures for my CPU.

    It's an AMD Ryzenâ„¢ 5 3400G. It says on the spec it has:
    4MB L3 cache
    2MB L2 cache
    384kb L1 cache

    What I do is to xor a location in memory in an array many times.
    The size of the area I xor over is set by a mask on the store index.
    The words in the store are 64 bit.

    A C++ fragment is this. I can post the whole thing if it would help.

    // Calculate a bit mask for the entire store
    Word mask = storeWordCount - 1;

    Stopwatch s;
    s.start();
    while (1)       // until break when mask runs out
    {
            for (size_t index = 0; index < storeWordCount; ++index)
            {
                    // read and write a word in store.
                    Raw[index & mask] ^= index;
            }
            s.lap(mask);            // records the current time

            if (mask == 0) break;   // Stop if we've run out of mask

            mask >>= 1;             // shrink the mask
    }

    As you can see it starts with a large mask (in fact for a whole GB) and halves it as it goes around.

    All looks fine at first. I get about 8GB per second with a large mask,
    at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
    gets smaller. No apparent effect when it gets under the L1 cache size.

    But...
    When the mask is very small (3) it slows to 18GB/s. With 1 it halves
    again, and with zero (so it only operates on the same word over and
    over) it's half again. A fifth of the size with a large block.

    Something odd is happening here when I hammer the same location (32
    bytes and on down) so that it's slower. Yet this ought to be in the L1
    data cache.

    A late thought was to replace that ^= index with something that reads
    the memory only, or that writes it only, instead of doing a read-modify-write cycle. That gives me much faster performance with
    writes than reads. And neither read only, nor write only, show this odd
    slow down with small masks.

    Mitch, Anton and Michael have already answered, I just want to add that
    we have one additional potential factor:

    Rowhammer protection:

    It is possible that the pattern of re-XORing the same or a small number
    of locations over and over could trigger a pattern detector which was
    designed to mitigate against Rowhammer.

    OTOH, this would much more easily be handled with memory range based coalescing of write operations in the last level cache, right?

    I.e. for normal (write combining) memory, it would (afaik) be legal to
    delay the actual writes to RAM for a significant time, long enough to
    merge multiple memory writes.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Terje Mathisen on Wed Jan 31 08:17:13 2024
    Terje Mathisen <[email protected]> writes:
    Rowhammer protection:

    It is possible that the pattern of re-XORing the same or a small number=20
    of locations over and over could trigger a pattern detector which was=20 >designed to mitigate against Rowhammer.

    I don't think that memory controller designers have actually
    implemented Rowhammer protection: I would expect that the processor manufacturers would have bregged about that if they had. They have
    not. And even RAM manufacturers have stopped mentioning anything
    about Rowhammer in their specs. It seems that all hardware
    manufacturers have decided that Rowhammer is something that will just
    disappear from public knowledge (and therefore from what they have to
    deal with) if they just ignore it long enough. It appears that they
    are right.

    They seem to take the same approach wrt Spectre-family attacks. In
    that case, however, new variants appear all the time, so maybe the
    approach won't work here.

    However, in the present case "the same small number of locations" is
    not hammered, because a small number of memory locations fits into the
    cache in the adjacent access pattern that this test uses, and all
    writes will just be to the cache.

    OTOH, this would much more easily be handled with memory range based=20 >coalescing of write operations in the last level cache, right?

    We have had write-back caches (at the L2 or L1 level, and certainly at
    the LLC level) since the later 486 years.

    I.e. for normal (write combining) memory

    Normal memory is write-back. AFAIK write combining is for stuff like
    graphics card memory.

    it would (afaik) be legal to=20
    delay the actual writes to RAM for a significant time, long enough to=20 >merge multiple memory writes.

    And this is what actually happens, through the magic of write-back
    caches.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Terje Mathisen on Wed Jan 31 13:13:53 2024
    On Wed, 31 Jan 2024 07:59:41 +0100
    Terje Mathisen <[email protected]> wrote:

    Vir Campestris wrote:
    I've knocked up a little utility program to try to work out some performance figures for my CPU.

    It's an AMD Ryzenâ„¢ 5 3400G. It says on the spec it has:
    4MB L3 cache
    2MB L2 cache
    384kb L1 cache

    What I do is to xor a location in memory in an array many times.
    The size of the area I xor over is set by a mask on the store index.
    The words in the store are 64 bit.

    A C++ fragment is this. I can post the whole thing if it would help.

    // Calculate a bit mask for the entire store
    Word mask = storeWordCount - 1;

    Stopwatch s;
    s.start();
    while (1)       // until break when mask runs out
    {
            for (size_t index = 0; index < storeWordCount; ++index)
            {
                    // read and write a word in store.
                    Raw[index & mask] ^= index;
            }
            s.lap(mask);            // records the current time

            if (mask == 0) break;   // Stop if we've run out of mask

            mask >>= 1;             // shrink the mask
    }

    As you can see it starts with a large mask (in fact for a whole GB)
    and halves it as it goes around.

    All looks fine at first. I get about 8GB per second with a large
    mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
    as the mask gets smaller. No apparent effect when it gets under the
    L1 cache size.

    But...
    When the mask is very small (3) it slows to 18GB/s. With 1 it
    halves again, and with zero (so it only operates on the same word
    over and over) it's half again. A fifth of the size with a large
    block.

    Something odd is happening here when I hammer the same location (32
    bytes and on down) so that it's slower. Yet this ought to be in the
    L1 data cache.

    A late thought was to replace that ^= index with something that
    reads the memory only, or that writes it only, instead of doing a read-modify-write cycle. That gives me much faster performance with
    writes than reads. And neither read only, nor write only, show this
    odd slow down with small masks.

    Mitch, Anton and Michael have already answered, I just want to add
    that we have one additional potential factor:

    Rowhammer protection:

    It is possible that the pattern of re-XORing the same or a small
    number of locations over and over could trigger a pattern detector
    which was designed to mitigate against Rowhammer.

    OTOH, this would much more easily be handled with memory range based coalescing of write operations in the last level cache, right?

    I.e. for normal (write combining) memory, it would (afaik) be legal
    to delay the actual writes to RAM for a significant time, long enough
    to merge multiple memory writes.

    Terje



    I have very little to add to very good response by Anton.
    That little addition is: the most if not all Rowhammer POC examples rely
    on CLFLUSH. That's what the manual says about it:
    "Executions of the CLFLUSH instruction are ordered with respect to each
    other and with respect to writes, locked read-modify-write
    instructions, fence instructions, and executions of CLFLUSHOPT to the
    same cache line.1 They are not ordered with respect to executions of
    CLFLUSHOPT to different cache lines."

    By now, it seems obvious that making CLFLUSH instruction non-privilaged
    and pretty much non-restricted by memory range/page attributes was a
    mistake, but that mistake can't be fixed without breaking things.
    Considering that CLFLUSH exists since very early 2000s, it is
    understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Wed Jan 31 15:04:50 2024
    Michael S <[email protected]> writes:
    On Wed, 31 Jan 2024 07:59:41 +0100
    Terje Mathisen <[email protected]> wrote:


    By now, it seems obvious that making CLFLUSH instruction non-privilaged
    and pretty much non-restricted by memory range/page attributes was a
    mistake, but that mistake can't be fixed without breaking things.
    Considering that CLFLUSH exists since very early 2000s, it is
    understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    ARMv8 has a control bit that can be set to allow EL0 access
    to the DC system instructions. By default it is a privileged
    instruction. It is up to the operating software to enable
    it for user-mode code.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Wed Jan 31 17:17:21 2024
    Michael S <[email protected]> writes:
    I have very little to add to very good response by Anton.
    That little addition is: the most if not all Rowhammer POC examples rely
    on CLFLUSH. That's what the manual says about it:
    "Executions of the CLFLUSH instruction are ordered with respect to each
    other and with respect to writes, locked read-modify-write
    instructions, fence instructions, and executions of CLFLUSHOPT to the
    same cache line.1 They are not ordered with respect to executions of >CLFLUSHOPT to different cache lines."

    By now, it seems obvious that making CLFLUSH instruction non-privilaged
    and pretty much non-restricted by memory range/page attributes was a
    mistake, but that mistake can't be fixed without breaking things.
    Considering that CLFLUSH exists since very early 2000s, it is
    understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    Ideally caches are fully transparent microarchitecture, then you don't
    need stuff like CLFLUSH. My guess is that CLFLUSH is there for
    getting DRAM up-to-date for DMA from I/O devices. An alternative
    would be to let the memory controller remember which lines are
    modified, and if the I/O device asks for that line, get the up-to-date
    data from the cache line using the cache-consistency protocol. This
    would turn CLFLUSH into a noop (at least as far as writing to DRAM is concerned, the ordering constraints may still be relevant), so there
    is a way to fix this mistake (if it is one).

    However, AFAIK this is insufficient for fixing Rowhammer. Caches have relatively limited associativity, up to something like 16-way set-associativity, so if you write to the same set 17 times, you are
    guaranteed to miss the cache. With 3 levels of cache you may need 49
    accesses (probably less), but I expect that the resulting DRAM
    accesses to a cache line are still not rare enough that Rowhammer
    cannot happen.

    The first paper on Rowhammer already outlined how the memory
    controller could count how often adjacent DRAM rows are accessed and
    thus weaken the row under consideration. This approach needs a little adjustment for Double Rowhammer and not immediately neighbouring rows,
    but otherwise seems to me to be the way to go. With autorefresh in
    the DRAM devices these days, the DRAM manufacturers could implement
    this on their own, without needing to coordinate with memory
    controller designers. But apparently they think that the customers
    don't care, so they can save the expense.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Wed Jan 31 20:12:15 2024
    Anton Ertl wrote:

    Michael S <[email protected]> writes:
    I have very little to add to very good response by Anton.
    That little addition is: the most if not all Rowhammer POC examples rely
    on CLFLUSH. That's what the manual says about it:
    "Executions of the CLFLUSH instruction are ordered with respect to each >>other and with respect to writes, locked read-modify-write
    instructions, fence instructions, and executions of CLFLUSHOPT to the
    same cache line.1 They are not ordered with respect to executions of >>CLFLUSHOPT to different cache lines."

    By now, it seems obvious that making CLFLUSH instruction non-privilaged
    and pretty much non-restricted by memory range/page attributes was a >>mistake, but that mistake can't be fixed without breaking things. >>Considering that CLFLUSH exists since very early 2000s, it is >>understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    Ideally caches are fully transparent microarchitecture, then you don't
    need stuff like CLFLUSH. My guess is that CLFLUSH is there for
    getting DRAM up-to-date for DMA from I/O devices.

    I have wondered for a while about why device access is not to coherent
    space. If it were so, then no CFLUSH functionality is needed, I/O can
    just read/write an address and always get the freshest copy. {{Maybe
    not the device itself, but the PCIe Root could translate from device
    access space to memory access space (coherent).}}

    An alternative
    would be to let the memory controller remember which lines are
    modified, and if the I/O device asks for that line, get the up-to-date
    data from the cache line using the cache-consistency protocol. This
    would turn CLFLUSH into a noop (at least as far as writing to DRAM is concerned, the ordering constraints may still be relevant), so there
    is a way to fix this mistake (if it is one).

    However, AFAIK this is insufficient for fixing Rowhammer.

    If L3 (LLC) is not a processor cache but a great big read/write buffer
    for DRAM, then Rowhammering is significantly harder to accomplish.

    Caches have relatively limited associativity, up to something like 16-way set-associativity, so if you write to the same set 17 times, you are guaranteed to miss the cache. With 3 levels of cache you may need 49 accesses (probably less), but I expect that the resulting DRAM
    accesses to a cache line are still not rare enough that Rowhammer
    cannot happen.

    Rowhammer happens when you beat on the same cache line multiple times
    {causing a charge sharing problem on the word lines. Every time you cause
    the DRAM to precharge (deActivate) you lose the count on how many times
    you have to bang on the same word line to disrupt the stored cells.

    So, the trick is to detect the RowHammering and insert refresh commands.

    The first paper on Rowhammer already outlined how the memory
    controller could count how often adjacent DRAM rows are accessed and
    thus weaken the row under consideration. This approach needs a little adjustment for Double Rowhammer and not immediately neighbouring rows,
    but otherwise seems to me to be the way to go. With autorefresh in
    the DRAM devices these days, the DRAM manufacturers could implement
    this on their own, without needing to coordinate with memory
    controller designers. But apparently they think that the customers
    don't care, so they can save the expense.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Anton Ertl on Wed Jan 31 22:49:15 2024
    On Wed, 31 Jan 2024 17:17:21 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    I have very little to add to very good response by Anton.
    That little addition is: the most if not all Rowhammer POC examples
    rely on CLFLUSH. That's what the manual says about it:
    "Executions of the CLFLUSH instruction are ordered with respect to
    each other and with respect to writes, locked read-modify-write >instructions, fence instructions, and executions of CLFLUSHOPT to the
    same cache line.1 They are not ordered with respect to executions of >CLFLUSHOPT to different cache lines."

    By now, it seems obvious that making CLFLUSH instruction
    non-privilaged and pretty much non-restricted by memory range/page >attributes was a mistake, but that mistake can't be fixed without
    breaking things. Considering that CLFLUSH exists since very early
    2000s, it is understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    Ideally caches are fully transparent microarchitecture, then you don't
    need stuff like CLFLUSH. My guess is that CLFLUSH is there for
    getting DRAM up-to-date for DMA from I/O devices. An alternative
    would be to let the memory controller remember which lines are
    modified, and if the I/O device asks for that line, get the up-to-date
    data from the cache line using the cache-consistency protocol.

    Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
    and that at that time all Intel's PCI/AGP root hubs were already fully I/O-coherent for several years, I find your theory unlikely.

    Myself, I don't know the original reason, but I do know a use case
    where CLFLUSH, while not strictly necessary, simplifies things greatly
    - entering deep sleep state in which CPU caches are powered down and
    DRAM put in self-refresh mode.

    Of course, this particular use case does not require *non-priviledged*
    CLFLUSH, so obviously Intel had different reason.


    This
    would turn CLFLUSH into a noop (at least as far as writing to DRAM is concerned, the ordering constraints may still be relevant), so there
    is a way to fix this mistake (if it is one).

    However, AFAIK this is insufficient for fixing Rowhammer. Caches have relatively limited associativity, up to something like 16-way set-associativity, so if you write to the same set 17 times, you are guaranteed to miss the cache. With 3 levels of cache you may need 49 accesses (probably less), but I expect that the resulting DRAM
    accesses to a cache line are still not rare enough that Rowhammer
    cannot happen.


    Original RH required very high hammering rate that certainly can't be
    achieved by playing with associativity of L3 cache.

    Newer multiside hammering probably can do it in theory, but it would be
    very difficult in practice.

    Today we have yet another variant called RowPress that bypasses TRR
    mitigation more reliably than mult-rate RH. I think this one would be practically impossible without CLFLUSH., esp. when system under attack
    carries other DRAM accesses in parallel with attackers code.


    The first paper on Rowhammer already outlined how the memory
    controller could count how often adjacent DRAM rows are accessed and
    thus weaken the row under consideration. This approach needs a little adjustment for Double Rowhammer and not immediately neighbouring rows,
    but otherwise seems to me to be the way to go.

    IMHO, all thise solutions are pure fantasy, because memory controller
    does not even know which rows are physically adjacent. POC authors
    typically run lengthy tests in order to figure it out.


    With autorefresh in
    the DRAM devices these days, the DRAM manufacturers could implement
    this on their own, without needing to coordinate with memory
    controller designers. But apparently they think that the customers
    don't care, so they can save the expense.

    - anton


    They cared enough to implement the simplest of proposed solutions - TRR.
    Yes, it was quickly found insufficient, but at least there was a
    demonstration of good intentions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Michael S on Wed Jan 31 23:22:38 2024
    Michael S wrote:

    On Wed, 31 Jan 2024 17:17:21 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    I have very little to add to very good response by Anton.
    That little addition is: the most if not all Rowhammer POC examples
    rely on CLFLUSH. That's what the manual says about it:
    "Executions of the CLFLUSH instruction are ordered with respect to
    each other and with respect to writes, locked read-modify-write
    instructions, fence instructions, and executions of CLFLUSHOPT to the
    same cache line.1 They are not ordered with respect to executions of
    CLFLUSHOPT to different cache lines."

    By now, it seems obvious that making CLFLUSH instruction
    non-privilaged and pretty much non-restricted by memory range/page
    attributes was a mistake, but that mistake can't be fixed without
    breaking things. Considering that CLFLUSH exists since very early
    2000s, it is understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    Ideally caches are fully transparent microarchitecture, then you don't
    need stuff like CLFLUSH. My guess is that CLFLUSH is there for
    getting DRAM up-to-date for DMA from I/O devices. An alternative
    would be to let the memory controller remember which lines are
    modified, and if the I/O device asks for that line, get the up-to-date
    data from the cache line using the cache-consistency protocol.

    Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
    and that at that time all Intel's PCI/AGP root hubs were already fully I/O-coherent for several years, I find your theory unlikely.

    Myself, I don't know the original reason, but I do know a use case
    where CLFLUSH, while not strictly necessary, simplifies things greatly
    - entering deep sleep state in which CPU caches are powered down and
    DRAM put in self-refresh mode.

    Of course, this particular use case does not require *non-priviledged* CLFLUSH, so obviously Intel had different reason.

    There was no assumption that this could result in a side channel or
    attack vector at the time of its non-privileged inclusion. Afterwards
    there was no reason to make it privileged until 2017 and by then the
    ability to do anything about it has vanished.

    Me, personally, I see this as a violation of the cache is there to
    reduce memory latency principle and thereby improve performance.

    This
    would turn CLFLUSH into a noop (at least as far as writing to DRAM is
    concerned, the ordering constraints may still be relevant), so there
    is a way to fix this mistake (if it is one).

    However, AFAIK this is insufficient for fixing Rowhammer. Caches have
    relatively limited associativity, up to something like 16-way
    set-associativity, so if you write to the same set 17 times, you are
    guaranteed to miss the cache. With 3 levels of cache you may need 49
    accesses (probably less), but I expect that the resulting DRAM
    accesses to a cache line are still not rare enough that Rowhammer
    cannot happen.


    Original RH required very high hammering rate that certainly can't be achieved by playing with associativity of L3 cache.

    Newer multiside hammering probably can do it in theory, but it would be
    very difficult in practice.

    The problem here is the fact that DRAMs do not use linear decoders, so
    address X and address X+1 do not necessarily shared paired word lines.
    The word lines could be as far as ½ the block away from each other.

    The DRAM decoders are faster and smaller when there is a grey-like-code
    imposed on the logical-address to physical-word-line. This also happens
    in SRAM decoders. Going back and looking at the most used logical to
    physical mapping shows that while X and X+1 can (occasionally) be side
    by side, X, X+1 and X+2 should never be 3 words lines in a row.

    Today we have yet another variant called RowPress that bypasses TRR mitigation more reliably than mult-rate RH. I think this one would be practically impossible without CLFLUSH., esp. when system under attack carries other DRAM accesses in parallel with attackers code.

    The first paper on Rowhammer already outlined how the memory
    controller could count how often adjacent DRAM rows are accessed and
    thus weaken the row under consideration. This approach needs a little
    adjustment for Double Rowhammer and not immediately neighbouring rows,
    but otherwise seems to me to be the way to go.

    IMHO, all thise solutions are pure fantasy, because memory controller
    does not even know which rows are physically adjacent.

    Different DIMMs and even different DRAMs on the same DIMM may not
    share that correspondence. {There is a lot of bit line and a little
    word line repair done at the tester.}

    POC authors
    typically run lengthy tests in order to figure it out.


    With autorefresh in
    the DRAM devices these days, the DRAM manufacturers could implement
    this on their own, without needing to coordinate with memory
    controller designers. But apparently they think that the customers
    don't care, so they can save the expense.

    - anton


    They cared enough to implement the simplest of proposed solutions - TRR.
    Yes, it was quickly found insufficient, but at least there was a demonstration of good intentions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From [email protected]@21:1/5 to Michael S on Thu Feb 1 09:39:13 2024
    Michael S <[email protected]> wrote:

    By now, it seems obvious that making CLFLUSH instruction non-privilaged
    and pretty much non-restricted by memory range/page attributes was a
    mistake, but that mistake can't be fixed without breaking things.
    Considering that CLFLUSH exists since very early 2000s, it is
    understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    For Arm, with its non-coherent data and instruction caches, you need
    some way to flush dcache to the point of unification in order to make instruction changes visible. Also, regardless of icache coherence, when
    using non-volatile memory you need an efficient way to flush dcache to
    the point of peristence. You need that in order to make sure that a
    transaction has been written to a log.

    With the latter, you could restrict dcache flushes to pages with a
    particular non-volatile attribute. I don't think there's anything you
    can do about the former, short of simply making i- and d-cache
    coherent. Which is a good idea, but not everyone does it.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Thu Feb 1 15:36:46 2024
    On Thu, 01 Feb 2024 09:39:13 +0000
    [email protected]d wrote:

    Michael S <[email protected]> wrote:

    By now, it seems obvious that making CLFLUSH instruction
    non-privilaged and pretty much non-restricted by memory range/page attributes was a mistake, but that mistake can't be fixed without
    breaking things. Considering that CLFLUSH exists since very early
    2000s, it is understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    For Arm, with its non-coherent data and instruction caches, you need
    some way to flush dcache to the point of unification in order to make instruction changes visible. Also, regardless of icache coherence,
    when using non-volatile memory you need an efficient way to flush
    dcache to the point of peristence. You need that in order to make
    sure that a transaction has been written to a log.

    With the latter, you could restrict dcache flushes to pages with a
    particular non-volatile attribute. I don't think there's anything you
    can do about the former, short of simply making i- and d-cache
    coherent.

    For the later, privileged flush instruction sounds sufficient.

    For the former, ARMv8 appears to have a special instruction (or you can
    call it a special variant of DC instruction) - Clean by virtual address
    to point of unification (DC CVAU). This instruction alone would not
    make RH attack much easier. The problem is that privilagability of this instruction controlled by the same bit as privilagability of two much
    more dangerous variations of DC (DC CVAC and DC CIVAC).

    Which is a good idea, but not everyone does it.

    Andrew.

    Neoverse N1 had it. I don't know about the rest of Neoverse series.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Thu Feb 1 09:05:19 2024
    Anton Ertl wrote:
    Michael S <[email protected]> writes:
    I have very little to add to very good response by Anton.
    That little addition is: the most if not all Rowhammer POC examples rely
    on CLFLUSH. That's what the manual says about it:
    "Executions of the CLFLUSH instruction are ordered with respect to each
    other and with respect to writes, locked read-modify-write
    instructions, fence instructions, and executions of CLFLUSHOPT to the
    same cache line.1 They are not ordered with respect to executions of
    CLFLUSHOPT to different cache lines."

    By now, it seems obvious that making CLFLUSH instruction non-privilaged
    and pretty much non-restricted by memory range/page attributes was a
    mistake, but that mistake can't be fixed without breaking things.
    Considering that CLFLUSH exists since very early 2000s, it is
    understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    Ideally caches are fully transparent microarchitecture, then you don't
    need stuff like CLFLUSH. My guess is that CLFLUSH is there for
    getting DRAM up-to-date for DMA from I/O devices. An alternative
    would be to let the memory controller remember which lines are
    modified, and if the I/O device asks for that line, get the up-to-date
    data from the cache line using the cache-consistency protocol. This
    would turn CLFLUSH into a noop (at least as far as writing to DRAM is concerned, the ordering constraints may still be relevant), so there
    is a way to fix this mistake (if it is one).

    The text in Intel Vol 1 Architecture manual indicates they viewed all
    these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
    as part of SSE for use by graphics applications that want to take
    manual control of their caching and minimize cache pollution.

    Note that the non-temporal move instructions MOVNTxx were also part of
    that SSE bunch and could also be used to force a write to DRAM.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Thu Feb 1 09:20:24 2024
    Michael S wrote:
    On Wed, 31 Jan 2024 17:17:21 GMT
    [email protected] (Anton Ertl) wrote:

    Michael S <[email protected]> writes:
    I have very little to add to very good response by Anton.
    That little addition is: the most if not all Rowhammer POC examples
    rely on CLFLUSH. That's what the manual says about it:
    "Executions of the CLFLUSH instruction are ordered with respect to
    each other and with respect to writes, locked read-modify-write
    instructions, fence instructions, and executions of CLFLUSHOPT to the
    same cache line.1 They are not ordered with respect to executions of
    CLFLUSHOPT to different cache lines."

    By now, it seems obvious that making CLFLUSH instruction
    non-privilaged and pretty much non-restricted by memory range/page
    attributes was a mistake, but that mistake can't be fixed without
    breaking things. Considering that CLFLUSH exists since very early
    2000s, it is understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.
    Ideally caches are fully transparent microarchitecture, then you don't
    need stuff like CLFLUSH. My guess is that CLFLUSH is there for
    getting DRAM up-to-date for DMA from I/O devices. An alternative
    would be to let the memory controller remember which lines are
    modified, and if the I/O device asks for that line, get the up-to-date
    data from the cache line using the cache-consistency protocol.

    Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
    and that at that time all Intel's PCI/AGP root hubs were already fully I/O-coherent for several years, I find your theory unlikely.

    Myself, I don't know the original reason, but I do know a use case
    where CLFLUSH, while not strictly necessary, simplifies things greatly
    - entering deep sleep state in which CPU caches are powered down and
    DRAM put in self-refresh mode.

    CLFLUSH wouldn't be useful for that as it flushes for a virtual address.
    It also allows all sorts reorderings that you don't want to think about
    during a (possibly emergency) cache sync.

    The privileged WBINVD and WBNOINVD instructions are intended for that.
    It sounds like they basically halt the core for the duration of the
    write back of all modified lines.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Thu Feb 1 16:30:27 2024
    On Thu, 01 Feb 2024 09:05:19 -0500
    EricP <[email protected]> wrote:

    Anton Ertl wrote:
    Michael S <[email protected]> writes:
    I have very little to add to very good response by Anton.
    That little addition is: the most if not all Rowhammer POC
    examples rely on CLFLUSH. That's what the manual says about it:
    "Executions of the CLFLUSH instruction are ordered with respect to
    each other and with respect to writes, locked read-modify-write
    instructions, fence instructions, and executions of CLFLUSHOPT to
    the same cache line.1 They are not ordered with respect to
    executions of CLFLUSHOPT to different cache lines."

    By now, it seems obvious that making CLFLUSH instruction
    non-privilaged and pretty much non-restricted by memory range/page
    attributes was a mistake, but that mistake can't be fixed without
    breaking things. Considering that CLFLUSH exists since very early
    2000s, it is understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    Ideally caches are fully transparent microarchitecture, then you
    don't need stuff like CLFLUSH. My guess is that CLFLUSH is there
    for getting DRAM up-to-date for DMA from I/O devices. An
    alternative would be to let the memory controller remember which
    lines are modified, and if the I/O device asks for that line, get
    the up-to-date data from the cache line using the cache-consistency protocol. This would turn CLFLUSH into a noop (at least as far as
    writing to DRAM is concerned, the ordering constraints may still be relevant), so there is a way to fix this mistake (if it is one).

    The text in Intel Vol 1 Architecture manual indicates they viewed all
    these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
    as part of SSE for use by graphics applications that want to take
    manual control of their caching and minimize cache pollution.

    Note that the non-temporal move instructions MOVNTxx were also part of
    that SSE bunch and could also be used to force a write to DRAM.


    According to Wikipedia, CLFLUSH was not introduced with SSE.
    It was introduced together with SSE2, but formally is not part of it. CLFLUSHOPT came much, much, much later and was likely related to Optane
    DIMMs aspirations of late 2010s.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From [email protected]@21:1/5 to Michael S on Fri Feb 2 10:20:17 2024
    Michael S <[email protected]> wrote:
    On Thu, 01 Feb 2024 09:39:13 +0000
    [email protected]d wrote:

    Michael S <[email protected]> wrote:

    By now, it seems obvious that making CLFLUSH instruction
    non-privilaged and pretty much non-restricted by memory range/page
    attributes was a mistake, but that mistake can't be fixed without
    breaking things. Considering that CLFLUSH exists since very early
    2000s, it is understandable.
    IIRC, ARMv8 did the same mistake a decade later. It is less
    understandable.

    For Arm, with its non-coherent data and instruction caches, you need
    some way to flush dcache to the point of unification in order to make
    instruction changes visible. Also, regardless of icache coherence,
    when using non-volatile memory you need an efficient way to flush
    dcache to the point of peristence. You need that in order to make
    sure that a transaction has been written to a log.

    With the latter, you could restrict dcache flushes to pages with a
    particular non-volatile attribute. I don't think there's anything you
    can do about the former, short of simply making i- and d-cache
    coherent.

    For the later, privileged flush instruction sounds sufficient.

    Does it? You're trying for hight throughput, and a full system call
    wouldn't help with that. And besides, if userspace can ask kernel to
    do something on its behalf, you haven't added any security by making
    it privileged.

    For the former, ARMv8 appears to have a special instruction (or you can
    call it a special variant of DC instruction) - Clean by virtual address
    to point of unification (DC CVAU). This instruction alone would not
    make RH attack much easier. The problem is that privilagability of this instruction controlled by the same bit as privilagability of two much
    more dangerous variations of DC (DC CVAC and DC CIVAC).

    Ah, thanks.

    Andrew.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Fri Feb 2 12:03:41 2024
    MitchAlsup wrote:
    Anton Ertl wrote:


    Rowhammer happens when you beat on the same cache line multiple times {causing a charge sharing problem on the word lines. Every time you cause
    the DRAM to precharge (deActivate) you lose the count on how many times
    you have to bang on the same word line to disrupt the stored cells.

    So, the trick is to detect the RowHammering and insert refresh commands.

    It's not just the immediately physically adjacent rows -
    I think I read that the effect falls off for up to +-3 rows away.

    Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.

    And the threshold when it triggers has been changing as drams become more dense. In 2014 when this was first encountered it took 139K activations.
    By 2020 that was down to 4.8K.

    So figuring out how much a row has been damaged is complicated,
    and the window for detecting it is getting smaller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to EricP on Fri Feb 2 18:09:14 2024
    EricP <[email protected]> schrieb:

    Then a giant 2 million input OR gate would tell us if any row's
    canary had flipped.

    That would look... interesting.

    How are large OR gates actually constructed? I would assume that an eight-input OR gate could look something like

    nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))

    which would reduce the number of inputs by a factor of 2^3, so
    seven layers of these OR gates would be needed.

    Wiring would be interesting as well...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Fri Feb 2 12:15:21 2024
    MitchAlsup wrote:
    Michael S wrote:

    Original RH required very high hammering rate that certainly can't be
    achieved by playing with associativity of L3 cache.

    Newer multiside hammering probably can do it in theory, but it would be
    very difficult in practice.

    The problem here is the fact that DRAMs do not use linear decoders, so address X and address X+1 do not necessarily shared paired word lines.
    The word lines could be as far as ½ the block away from each other.

    The DRAM decoders are faster and smaller when there is a grey-like-code imposed on the logical-address to physical-word-line. This also happens
    in SRAM decoders. Going back and looking at the most used logical to
    physical mapping shows that while X and X+1 can (occasionally) be side
    by side, X, X+1 and X+2 should never be 3 words lines in a row.

    A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
    So having a counter for each row is impractical.

    I was wondering if each row could have "canary" bit,
    a specially weakened bit that always flips early.
    This would also intrinsically handle the cases of effects
    falling off over the +-3 adjacent rows.

    Then a giant 2 million input OR gate would tell us if any row's
    canary had flipped.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to EricP on Fri Feb 2 19:34:25 2024
    EricP wrote:

    MitchAlsup wrote:
    Anton Ertl wrote:


    Rowhammer happens when you beat on the same cache line multiple times
    {causing a charge sharing problem on the word lines. Every time you cause
    the DRAM to precharge (deActivate) you lose the count on how many times
    you have to bang on the same word line to disrupt the stored cells.

    So, the trick is to detect the RowHammering and insert refresh commands.

    It's not just the immediately physically adjacent rows -
    I think I read that the effect falls off for up to +-3 rows away.

    My understanding is that RowHammer has to access the same row multiple times
    to disrupt bits in an adjacent row. This sounds like a charge sharing problem. A long time ago We found a problem with one manufactures SRAM when the same
    row was hit >6,000 times, there was enough charge sharing that the adjacent dynamic word decoder also fired so we had 2 or 3 word lines active at the
    same time. We encountered this when a LD missed the cache and was sent down through NorthBridge, SouthBridge, onto another bus, finally out to the device and back, while the CPU was continuing to read the ICache every cycle.

    My limited understanding of RowPress is that you should not keep the Row open for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My bet is
    that this is a leakage issue on the bit line made sensitive by the word line.

    Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.

    DRAMs are funny like this. Adjacent bit lines store data differently. Even
    bits store 0 as 0 and 1 as 1 while odd cells store 0 as 1 and 1 as 0. They
    do this so the sense amplified has a differential to sense, either the even cell or the odd cell is asserted on the bit line pair and the sense amp then has a differential to sense. One line goes up a little or down a little while the other bit line stays where it is.

    And the threshold when it triggers has been changing as drams become more dense. In 2014 when this was first encountered it took 139K activations.
    By 2020 that was down to 4.8K.

    So figuring out how much a row has been damaged is complicated,
    and the window for detecting it is getting smaller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Thomas Koenig on Fri Feb 2 19:18:12 2024
    Thomas Koenig wrote:

    EricP <[email protected]> schrieb:

    Then a giant 2 million input OR gate would tell us if any row's
    canary had flipped.

    That would look... interesting.

    How are large OR gates actually constructed? I would assume that an eight-input OR gate could look something like

    nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))

    Close, but NANDs come with 4-inputs and NORs come with 3*, so you get
    a 3×4 = 12:1 reduction per pair of stages.

    2985984->248832->20736->1728->144->12->1

    which would reduce the number of inputs by a factor of 2^3, so
    seven layers of these OR gates would be needed.

    6 not 7

    Wiring would be interesting as well...

    That is why we have 10 layers of metal--oh wait DRAMs don't have that
    much metal.....

    (*) NANDs having 4 inputs while NORs only have 3 is a consequence of
    P-channel transistors having lower transconductance and higher body
    effects, and there are differences between planar transistors and
    finFETs here, too.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Fri Feb 2 17:20:51 2024
    MitchAlsup wrote:
    EricP wrote:

    MitchAlsup wrote:
    Anton Ertl wrote:


    Rowhammer happens when you beat on the same cache line multiple times
    {causing a charge sharing problem on the word lines. Every time you
    cause
    the DRAM to precharge (deActivate) you lose the count on how many times
    you have to bang on the same word line to disrupt the stored cells.

    So, the trick is to detect the RowHammering and insert refresh commands.

    It's not just the immediately physically adjacent rows -
    I think I read that the effect falls off for up to +-3 rows away.

    My understanding is that RowHammer has to access the same row multiple
    times
    to disrupt bits in an adjacent row. This sounds like a charge sharing problem.

    Yes, as I understand it charge migration.
    I had a nice document on the root cause of Rowhammer but I can't seem to
    find it again. This one is a little heavy on the semiconductor physics:

    On dram rowhammer and the physics of insecurity, 2020 https://ieeexplore.ieee.org/iel7/16/9385809/09366976.pdf

    "Experimental evidence points to two mechanisms for the RH disturb,
    namely cell transistor subthreshold leakage and electron injection
    into the p-well of the DRAM array from the hammered cell transistors
    and their subsequent capture by the storage node (SN) junctions [13].

    Regarding the subthreshold leakage, lower cell transistor threshold
    voltages have been shown to correlate with higher susceptibility to RH.
    This is consistent with crosstalk between the switching aggressor wordline
    and the victim wordlines pulling up the latter sufficiently in the
    potential to drain away some of the victim cell’s stored charge [14], [15].

    Regarding the injected electrons from the hammered cell transistors,
    the blame for these has been placed on two different origins.
    The first describes a collapsing inversion layer associated with the
    hammered cell transistor where a population of electrons is injected
    into the p-well as the transistor’s gate turns off [16]. The second
    describes electron injection from charge traps near the silicon/gate
    dielectric interface of the cell select transistor [13], [17].
    Several studies look into techniques for hampering the migration of
    these injected electrons."

    A long time ago We found a problem with one manufactures SRAM when the same row was hit >6,000 times, there was enough charge sharing that the
    adjacent dynamic word decoder also fired so we had 2 or 3 word lines
    active at the same time. We encountered this when a LD missed the cache
    and was sent down
    through NorthBridge, SouthBridge, onto another bus, finally out to the
    device
    and back, while the CPU was continuing to read the ICache every cycle.

    I think of this as aging: each activation ages the rows up to some distance
    by amounts depending on the distance due to charge migration.

    Originally it was found by activating rows immediately adjacent to the
    victim but then they looked and found it further out to +-4 rows.
    This effect appears to be called the Rowhammer "blast radius".

    This paper is from 2023 but I'm sure I've seen mention of this effect
    before but not called blast radius.

    BLASTER: Characterizing the Blast Radius of Rowhammer, 2023 https://www.research-collection.ethz.ch/handle/20.500.11850/617284 https://dramsec.ethz.ch/papers/blaster.pdf

    "In particular, we show for the first time that BLASTER significantly
    reduces the number of necessary activations to the victim-adjacent
    aggressors using other aggressor rows that are up to four rows away
    from the victim."

    My limited understanding of RowPress is that you should not keep the Row
    open
    for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My
    bet is
    that this is a leakage issue on the bit line made sensitive by the word
    line.

    Yes, from what I read the factors affecting Rowhammer vulnerability are:

    1) DRAM chip temperature, 2) aggressor row active time,
    and 3) victim DRAM cell’s physical location.

    Also it may be data dependent - 0's bleed into adjacent 1's and 1's
    into 0's.

    DRAMs are funny like this. Adjacent bit lines store data differently. Even bits store 0 as 0 and 1 as 1 while odd cells store 0 as 1 and 1 as 0. They
    do this so the sense amplified has a differential to sense, either the even cell or the odd cell is asserted on the bit line pair and the sense amp
    then
    has a differential to sense. One line goes up a little or down a little
    while
    the other bit line stays where it is.

    And the threshold when it triggers has been changing as drams become more
    dense. In 2014 when this was first encountered it took 139K activations.
    By 2020 that was down to 4.8K.

    So figuring out how much a row has been damaged is complicated,
    and the window for detecting it is getting smaller.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to EricP on Fri Feb 2 18:24:10 2024
    EricP wrote:
    MitchAlsup wrote:

    A long time ago We found a problem with one manufactures SRAM when the
    same
    row was hit >6,000 times, there was enough charge sharing that the
    adjacent dynamic word decoder also fired so we had 2 or 3 word lines
    active at the same time. We encountered this when a LD missed the cache
    and was sent down
    through NorthBridge, SouthBridge, onto another bus, finally out to the
    device
    and back, while the CPU was continuing to read the ICache every cycle.

    I think of this as aging: each activation ages the rows up to some distance by amounts depending on the distance due to charge migration.

    Originally it was found by activating rows immediately adjacent to the
    victim but then they looked and found it further out to +-4 rows.
    This effect appears to be called the Rowhammer "blast radius".

    This paper is from 2023 but I'm sure I've seen mention of this effect
    before but not called blast radius.

    BLASTER: Characterizing the Blast Radius of Rowhammer, 2023 https://www.research-collection.ethz.ch/handle/20.500.11850/617284 https://dramsec.ethz.ch/papers/blaster.pdf

    "In particular, we show for the first time that BLASTER significantly
    reduces the number of necessary activations to the victim-adjacent
    aggressors using other aggressor rows that are up to four rows away
    from the victim."

    To elaborate a bit, as I understand it this means that if a dram
    has a blast radius of +-3 and we take 7 rows A B C D E F G,
    and assuming the aging factor is linear, then any read or refresh
    of row D resets its age to 0 but ages C&E by 3, B&F by 2, A&G by 1.
    If any row age total hits 15,000 its data dies.

    This is why I thought canary bits might work, because they integrate the
    sum of all adjacent activates while taking blast distance into account.
    As long as the canary _reliably_ dies at age 12,000 and the data at 15,000
    then the dram could transparently refresh the aged-out rows.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Sat Feb 3 09:28:14 2024
    EricP <[email protected]> writes:
    MitchAlsup wrote:
    Michael S wrote:

    Original RH required very high hammering rate that certainly can't be
    achieved by playing with associativity of L3 cache.

    Newer multiside hammering probably can do it in theory, but it would be
    very difficult in practice.

    The problem here is the fact that DRAMs do not use linear decoders, so
    address X and address X+1 do not necessarily shared paired word lines.
    The word lines could be as far as ½ the block away from each other.

    The DRAM decoders are faster and smaller when there is a grey-like-code
    imposed on the logical-address to physical-word-line. This also happens
    in SRAM decoders. Going back and looking at the most used logical to
    physical mapping shows that while X and X+1 can (occasionally) be side
    by side, X, X+1 and X+2 should never be 3 words lines in a row.

    A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
    So having a counter for each row is impractical.

    A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
    Admittedly, if you just update the counter for a specific row and the
    refresh all rows in the blast radius when a limit is reached, you
    may get many more refreshes than the minimum necessary, but given that
    normal programs usually do not hammer specific row ranges, the
    additional refreshes may still be relatively few in non-attack
    situations (and when being attacked, you prefer lower DRAM performance
    to a successful attack).

    Alternatively, a kind of cache could be used. Keep counts of N most
    recently accessed rows, remove the row on refresh; when accessing a
    row that has not been in the cache, evict the entry for the row with
    the lowest count C, and set the count of the loaded row to C+1. When
    a count (or ensemble of counts) reaches the limit, refresh every row.

    This would take much less memory, but require finding the entry with
    the lowest count. By dividing the cache into sets, this becomes more realistic; upon reaching a limit, only the rows in the blast radius of
    the lines in a set need to be refreshed.

    I was wondering if each row could have "canary" bit,
    a specially weakened bit that always flips early.
    This would also intrinsically handle the cases of effects
    falling off over the +-3 adjacent rows.

    Then a giant 2 million input OR gate would tell us if any row's
    canary had flipped.

    Yes, doing it in analog has its charms. However, I see the following difficulties:

    * How do you measure whether a bit has flipped without refreshing it
    and thus resetting the canary?

    * To flip a bit in one direction, AFAIK the hammering rows have to
    have a specific content. I guess with a blast radius of 4 rows on
    each side, you could have 4 columns. Each row has a canary in one
    of these columns and the three adjacent bits in this column are
    attacker bits that have the value that is useful for effecting a bit
    flip in a canary. Probably a more refined variant of this idea
    would be necessary is necessary to deal with diagonal influence and
    the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
    in this thread.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to Michael S on Sat Feb 3 08:42:18 2024
    Michael S <[email protected]> writes:
    On Wed, 31 Jan 2024 17:17:21 GMT
    [email protected] (Anton Ertl) wrote:
    The first paper on Rowhammer already outlined how the memory
    controller could count how often adjacent DRAM rows are accessed and
    thus weaken the row under consideration. This approach needs a little
    adjustment for Double Rowhammer and not immediately neighbouring rows,
    but otherwise seems to me to be the way to go.

    IMHO, all thise solutions are pure fantasy, because memory controller
    does not even know which rows are physically adjacent. POC authors
    typically run lengthy tests in order to figure it out.

    Given that the attackers can find out, it is just a lack of
    communication between DRAM manufacturers and memory controller
    manufacturers that result in that ignorance. Not a valid excuse.

    There is a standardization committee (JEDEC) that documents how
    various DRAM types are accessed, refreshed etc. They put information
    about that (and about RAM overclocking (XMP, Expo)) in the SPD ROMs of
    the DIMMs, so they can also put information about line adjacency
    there.

    With autorefresh in
    the DRAM devices these days, the DRAM manufacturers could implement
    this on their own, without needing to coordinate with memory
    controller designers. But apparently they think that the customers
    don't care, so they can save the expense.
    ...
    They cared enough to implement the simplest of proposed solutions - TRR.
    Yes, it was quickly found insufficient, but at least there was a >demonstration of good intentions.

    Yes. However, looking at Table III of <https://comsec.ethz.ch/wp-content/files/blacksmith_sp22.pdf>, there
    seems to be significant differences between manufacturers A and D on
    one hand, and B and C on the other, with exploits taking much longer
    for B and C, and failing in some cases.

    One may wonder if the DRAM manufacturers could have put their
    physicists to the task of identifying the conditions under which bit
    flips can occur, and identify the refreshes that are at least
    necessary to prevent these conditions from occuring. If they have not
    done so, or if they have not implemented the resulting recommendations
    (or passed them to the memory controller people), a certain amount of
    blame rests on them.

    Anyway, never mind the blame, looking into the future, I find it
    worrying that I did not find any mention of Rowhammer protection in
    the specs of DIMMs when I last looked.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to MitchAlsup on Sat Feb 3 12:12:03 2024
    MitchAlsup wrote:
    EricP wrote:

    MitchAlsup wrote:
    Anton Ertl wrote:


    Rowhammer happens when you beat on the same cache line multiple times
    {causing a charge sharing problem on the word lines. Every time you
    cause
    the DRAM to precharge (deActivate) you lose the count on how many times
    you have to bang on the same word line to disrupt the stored cells.

    So, the trick is to detect the RowHammering and insert refresh commands.

    It's not just the immediately physically adjacent rows -
    I think I read that the effect falls off for up to +-3 rows away.

    My understanding is that RowHammer has to access the same row multiple
    times
    to disrupt bits in an adjacent row. This sounds like a charge sharing problem.
    A long time ago We found a problem with one manufactures SRAM when the same row was hit >6,000 times, there was enough charge sharing that the
    adjacent dynamic word decoder also fired so we had 2 or 3 word lines
    active at the same time. We encountered this when a LD missed the cache
    and was sent down
    through NorthBridge, SouthBridge, onto another bus, finally out to the
    device
    and back, while the CPU was continuing to read the ICache every cycle.

    My limited understanding of RowPress is that you should not keep the Row
    open
    for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My
    bet is
    that this is a leakage issue on the bit line made sensitive by the word
    line.

    Ah I see from the RowPress paper that it is different from RowHammer.
    RowHammer is based on activation counts and RowPress on activation time. Previously papers had just said that activation time correlated with
    bit flips and I guess everyone just assumed it was the same mechanism.
    But the RowPress paper shows it affects different bits from RowHammer.
    Also RowPress and RowHammer tend to flip in different directions,
    RowHammer flips 0 to 1 and RowPress 1 to 0 (taking the true and anti
    cell logic states into account). Possibly one is doing electron injection
    and the other hole injection.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Sat Feb 3 17:10:30 2024
    Anton Ertl wrote:

    Michael S <[email protected]> writes:
    On Wed, 31 Jan 2024 17:17:21 GMT
    [email protected] (Anton Ertl) wrote:
    The first paper on Rowhammer already outlined how the memory
    controller could count how often adjacent DRAM rows are accessed and
    thus weaken the row under consideration. This approach needs a little
    adjustment for Double Rowhammer and not immediately neighbouring rows,
    but otherwise seems to me to be the way to go.

    IMHO, all thise solutions are pure fantasy, because memory controller
    does not even know which rows are physically adjacent. POC authors >>typically run lengthy tests in order to figure it out.

    Given that the attackers can find out, it is just a lack of
    communication between DRAM manufacturers and memory controller
    manufacturers that result in that ignorance. Not a valid excuse.

    There is a standardization committee (JEDEC) that documents how
    various DRAM types are accessed, refreshed etc. They put information
    about that (and about RAM overclocking (XMP, Expo)) in the SPD ROMs of
    the DIMMs, so they can also put information about line adjacency
    there.

    With autorefresh in
    the DRAM devices these days, the DRAM manufacturers could implement
    this on their own, without needing to coordinate with memory
    controller designers. But apparently they think that the customers
    don't care, so they can save the expense.
    ....
    They cared enough to implement the simplest of proposed solutions - TRR. >>Yes, it was quickly found insufficient, but at least there was a >>demonstration of good intentions.

    Yes. However, looking at Table III of <https://comsec.ethz.ch/wp-content/files/blacksmith_sp22.pdf>, there
    seems to be significant differences between manufacturers A and D on
    one hand, and B and C on the other, with exploits taking much longer
    for B and C, and failing in some cases.

    One may wonder if the DRAM manufacturers could have put their
    physicists to the task of identifying the conditions under which bit
    flips can occur, and identify the refreshes that are at least
    necessary to prevent these conditions from occuring. If they have not
    done so, or if they have not implemented the resulting recommendations
    (or passed them to the memory controller people), a certain amount of
    blame rests on them.

    Anyway, never mind the blame, looking into the future, I find it
    worrying that I did not find any mention of Rowhammer protection in
    the specs of DIMMs when I last looked.

    My information is that they (DRAM mfgs) looked and said they could not
    fix a problem that emanated from the DRAM controller.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Sat Feb 3 17:13:23 2024
    Anton Ertl wrote:


    Then a giant 2 million input OR gate would tell us if any row's
    canary had flipped.

    Yes, doing it in analog has its charms. However, I see the following difficulties:

    * How do you measure whether a bit has flipped without refreshing it
    and thus resetting the canary?

    You know what its value should be and you raise hell when it is not as expected. This may require 2 canary bits.

    * To flip a bit in one direction, AFAIK the hammering rows have to
    have a specific content. I guess with a blast radius of 4 rows on
    each side, you could have 4 columns. Each row has a canary in one
    of these columns and the three adjacent bits in this column are
    attacker bits that have the value that is useful for effecting a bit
    flip in a canary. Probably a more refined variant of this idea
    would be necessary is necessary to deal with diagonal influence and
    the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
    in this thread.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Sat Feb 3 17:45:31 2024
    [email protected] (MitchAlsup) writes:
    Anton Ertl wrote:


    Then a giant 2 million input OR gate would tell us if any row's
    canary had flipped.

    Yes, doing it in analog has its charms. However, I see the following
    difficulties:

    * How do you measure whether a bit has flipped without refreshing it
    and thus resetting the canary?

    You know what its value should be and you raise hell when it is not as >expected.

    So that is about detecting Rowhammer after the fact. Yes, you could
    do that when the row is refreshed. The only problem is that by then
    the attacker could have extracted the secret(s) with the
    Rowhammer-based attack. Better than nothing, but still not a very
    attractive approach.

    I prefer a solution that detects that a row might suffer a bit flip
    after several more accesses, and refreshes the row befor that happens.
    And I don't think that this can be implemented with an analog canary
    that works like a DRAM cell; but I am not a solid-state physicist,
    maybe there is a way.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup@21:1/5 to Anton Ertl on Sat Feb 3 19:10:56 2024
    Anton Ertl wrote:

    [email protected] (MitchAlsup) writes:
    Anton Ertl wrote:


    Then a giant 2 million input OR gate would tell us if any row's
    canary had flipped.

    Yes, doing it in analog has its charms. However, I see the following
    difficulties:

    * How do you measure whether a bit has flipped without refreshing it
    and thus resetting the canary?

    You know what its value should be and you raise hell when it is not as >>expected.

    So that is about detecting Rowhammer after the fact. Yes, you could
    do that when the row is refreshed. The only problem is that by then
    the attacker could have extracted the secret(s) with the
    Rowhammer-based attack. Better than nothing, but still not a very
    attractive approach.

    I prefer a solution that detects that a row might suffer a bit flip
    after several more accesses, and refreshes the row before that happens.
    And I don't think that this can be implemented with an analog canary
    that works like a DRAM cell; but I am not a solid-state physicist,
    maybe there is a way.

    Sooner or later, designers will have to come to the realization that
    an external DRAM controller can never guarantee everything every DRAM
    actually needs to retain data under all conditions, and the DRAMs
    are going to have to change the interface such that requests flow
    in and results flow out based on the DRAM internal controller--much
    like that of a SATA disk drive.

    Let us face it, the DDR-6 interface model is based on the 16K-bit
    DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
    double data rated, and each step added address bits to RAS and CAS.

    I suspect when this happens, the DRAMs will partition the inbound
    address into 3 or 4 sections, and use each section independently Bank-Row-Column or block-bank-row-column.

    In addition each building block will be internally self timed, no
    external need to refresh the bank-row, and the only non access
    command in the arsenal is power-down and power-up.

    You can only put so much lipstick on a pig.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Sun Feb 4 00:00:12 2024
    Anton Ertl wrote:
    EricP <[email protected]> writes:
    MitchAlsup wrote:
    Michael S wrote:

    Original RH required very high hammering rate that certainly can't be
    achieved by playing with associativity of L3 cache.
    Newer multiside hammering probably can do it in theory, but it would be >>>> very difficult in practice.
    The problem here is the fact that DRAMs do not use linear decoders, so
    address X and address X+1 do not necessarily shared paired word lines.
    The word lines could be as far as ½ the block away from each other.

    The DRAM decoders are faster and smaller when there is a grey-like-code
    imposed on the logical-address to physical-word-line. This also happens
    in SRAM decoders. Going back and looking at the most used logical to
    physical mapping shows that while X and X+1 can (occasionally) be side
    by side, X, X+1 and X+2 should never be 3 words lines in a row.
    A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
    So having a counter for each row is impractical.

    A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
    Admittedly, if you just update the counter for a specific row and the
    refresh all rows in the blast radius when a limit is reached, you
    may get many more refreshes than the minimum necessary, but given that
    normal programs usually do not hammer specific row ranges, the
    additional refreshes may still be relatively few in non-attack
    situations (and when being attacked, you prefer lower DRAM performance
    to a successful attack).

    They said that the current threshold for causing flips in an immediate
    neighbor is 4800 activations, but with a blast radius of +-4 that
    can be in any of the 8 neighbors, so your counter threshold will have
    to trigger refresh at 1/8 of that level or every 600 activations.

    And as the dram features get smaller that threshold number will go down
    and probably the blast radius will go up. So this could have scaling
    issues in the future.

    Alternatively, a kind of cache could be used. Keep counts of N most
    recently accessed rows, remove the row on refresh; when accessing a
    row that has not been in the cache, evict the entry for the row with
    the lowest count C, and set the count of the loaded row to C+1. When
    a count (or ensemble of counts) reaches the limit, refresh every row.

    That would be a CAM or assoc sram and would have to hold a large
    number of entries. This would have to be in the memory controller.

    This would take much less memory, but require finding the entry with
    the lowest count. By dividing the cache into sets, this becomes more realistic; upon reaching a limit, only the rows in the blast radius of
    the lines in a set need to be refreshed.

    I was wondering if each row could have "canary" bit,
    a specially weakened bit that always flips early.
    This would also intrinsically handle the cases of effects
    falling off over the +-3 adjacent rows.

    Then a giant 2 million input OR gate would tell us if any row's
    canary had flipped.

    Yes, doing it in analog has its charms. However, I see the following difficulties:

    * How do you measure whether a bit has flipped without refreshing it
    and thus resetting the canary?

    The canary would have to be a little more complicated than a standard
    storage cell because it has to compare the cell to the expected value
    and then drive an output transistor to pull down a dynamic bit line
    for a wired-OR of all the canaries in a bank.
    Hopefully that would isolate the canary from its read bit line changes.

    Fitting this into a dram row could be a problem.
    This would all have the same height as a normal row to fit horizontally
    along a dram row so it didn't bugger up the row spacing.

    * To flip a bit in one direction, AFAIK the hammering rows have to
    have a specific content. I guess with a blast radius of 4 rows on
    each side, you could have 4 columns. Each row has a canary in one
    of these columns and the three adjacent bits in this column are
    attacker bits that have the value that is useful for effecting a bit
    flip in a canary. Probably a more refined variant of this idea
    would be necessary is necessary to deal with diagonal influence and
    the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
    in this thread.

    - anton

    Each canary might be 3 cells with alternating patterns,
    even row numbers are inited to 010 and odd rows to 101,
    positioned in vertical columns. Presumably this would put
    the maximum and a predictable stress on the center bit.
    Since the expected value for each row is hard wired it is
    easy to test if it changes.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sun Feb 4 21:12:57 2024
    Anton Ertl wrote:

    EricP <[email protected]> writes:
    MitchAlsup wrote:
    Michael S wrote:

    Original RH required very high hammering rate that certainly can't be
    achieved by playing with associativity of L3 cache.

    Newer multiside hammering probably can do it in theory, but it would be >>>> very difficult in practice.

    The problem here is the fact that DRAMs do not use linear decoders, so
    address X and address X+1 do not necessarily shared paired word lines.
    The word lines could be as far as ½ the block away from each other.

    The DRAM decoders are faster and smaller when there is a grey-like-code
    imposed on the logical-address to physical-word-line. This also happens
    in SRAM decoders. Going back and looking at the most used logical to
    physical mapping shows that while X and X+1 can (occasionally) be side
    by side, X, X+1 and X+2 should never be 3 words lines in a row.

    A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
    So having a counter for each row is impractical.

    A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.

    You are comparing a 16-bit incrementor and its associated flip-flop
    with a single transistor divided by the number of them in a word. My
    guess is that you are off by 20× (should be close to 4%)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Mon Feb 5 09:08:34 2024
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    EricP <[email protected]> writes:
    A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
    So having a counter for each row is impractical.

    A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.

    You are comparing a 16-bit incrementor and its associated flip-flop
    with a single transistor divided by the number of them in a word.

    I was thinking about counting each access only when the cache line is
    accessed. Then there needs to be only one incrementor per bank, and
    the counter can be stored in DRAM like the payload data.

    But thinking about it again, I wonder how counters would be reset.
    Maybe, when the counter reaches the limit, all lines in its blast
    radius are refereshed, and the counter of the present line is reset to
    0.

    Another disadvantage would be that we have to make decisions about
    possible rowhammering only based on one counter, and have to trigger
    refreshes of all lines in the blast radius based on worst-case
    scenarios (i.e., assuming that other rows in the blast radius have any
    count up to the limit).

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Alternatively, if you want to invest more, one could follow your idea
    and have counter SRAM (maybe including counting circuitry) for each
    row; each refresh of a line would increment the counters in the blast
    radius by an appropriate amount, and when a counter reaches its limit,
    it would trigger a refresh of that row.

    My guess is that you are off by 20× (should be close to 4%)

    Even 4% is not "impractical".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to EricP on Mon Feb 5 09:35:21 2024
    EricP <[email protected]> writes:
    Anton Ertl wrote:
    A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
    Admittedly, if you just update the counter for a specific row and the
    refresh all rows in the blast radius when a limit is reached, you
    may get many more refreshes than the minimum necessary, but given that
    normal programs usually do not hammer specific row ranges, the
    additional refreshes may still be relatively few in non-attack
    situations (and when being attacked, you prefer lower DRAM performance
    to a successful attack).

    They said that the current threshold for causing flips in an immediate >neighbor is 4800 activations, but with a blast radius of +-4 that
    can be in any of the 8 neighbors, so your counter threshold will have
    to trigger refresh at 1/8 of that level or every 600 activations.

    So only 10 bits of counter are necessary, reducing the overhead to
    0.125%:-).

    And as the dram features get smaller that threshold number will go down
    and probably the blast radius will go up. So this could have scaling
    issues in the future.

    Yes.

    Alternatively, a kind of cache could be used. Keep counts of N most
    recently accessed rows, remove the row on refresh; when accessing a
    row that has not been in the cache, evict the entry for the row with
    the lowest count C, and set the count of the loaded row to C+1. When
    a count (or ensemble of counts) reaches the limit, refresh every row.

    That would be a CAM or assoc sram and would have to hold a large
    number of entries. This would have to be in the memory controller.

    Possibly. Recent DRAMs also support self-refresh (to allow powering
    down the connection to the memory controller); this kind of stuff
    could also be on the DRAM device, avoiding all the problems that
    memory controllers have with knowing the characteristics of the DRAM
    device.

    * How do you measure whether a bit has flipped without refreshing it
    and thus resetting the canary?

    The canary would have to be a little more complicated than a standard
    storage cell because it has to compare the cell to the expected value

    Maybe capacitative coupling (as used for flash AFAIK) could be used to
    measure the contents of the canary without discharging it. There
    still would be tunneling, as in Rowhammer itself, but I guess one
    could account for that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to MitchAlsup on Mon Feb 5 09:48:52 2024
    [email protected] (MitchAlsup) writes:
    Sooner or later, designers will have to come to the realization that
    an external DRAM controller can never guarantee everything every DRAM >actually needs to retain data under all conditions, and the DRAMs
    are going to have to change the interface such that requests flow
    in and results flow out based on the DRAM internal controller--much
    like that of a SATA disk drive.

    Let us face it, the DDR-6 interface model is based on the 16K-bit
    DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
    double data rated, and each step added address bits to RAS and CAS.

    I don't know about DDR6, but the DDR5 command interface is
    significantly more complex <https://en.wikipedia.org/wiki/DDR5#Command_encoding> than early
    asynchronous DRAM.

    I suspect when this happens, the DRAMs will partition the inbound
    address into 3 or 4 sections, and use each section independently >Bank-Row-Column or block-bank-row-column.

    Looking at the commands from the link above, Activate already
    transfers the row in two pieces, and the read and write are also
    transferred in two pieces.

    In addition each building block will be internally self timed, no
    external need to refresh the bank-row, and the only non access
    command in the arsenal is power-down and power-up.

    Self-refresh is already there, but AFAIK only used when processing is suspended.

    However, there are many commands, many more than in the 16kx1 DRAMs of
    old. What would make them go in the direction of simplifying the
    interface? The hardest part these days seems to be getting the high
    transfer rates to work, the rest of the interface is probably
    comparatively easy.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Mon Feb 5 22:30:19 2024
    Anton Ertl wrote:

    [email protected] (MitchAlsup) writes:
    Sooner or later, designers will have to come to the realization that
    an external DRAM controller can never guarantee everything every DRAM >>actually needs to retain data under all conditions, and the DRAMs
    are going to have to change the interface such that requests flow
    in and results flow out based on the DRAM internal controller--much
    like that of a SATA disk drive.

    Let us face it, the DDR-6 interface model is based on the 16K-bit
    DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
    double data rated, and each step added address bits to RAS and CAS.

    I don't know about DDR6, but the DDR5 command interface is
    significantly more complex <https://en.wikipedia.org/wiki/DDR5#Command_encoding> than early
    asynchronous DRAM.

    I suspect when this happens, the DRAMs will partition the inbound
    address into 3 or 4 sections, and use each section independently >>Bank-Row-Column or block-bank-row-column.

    Looking at the commands from the link above, Activate already
    transfers the row in two pieces, and the read and write are also
    transferred in two pieces.

    In addition each building block will be internally self timed, no
    external need to refresh the bank-row, and the only non access
    command in the arsenal is power-down and power-up.

    Self-refresh is already there, but AFAIK only used when processing is suspended.

    My DRAM controller (AMD Opteron rev G) used ACTivate commands instead of refresh commands to refresh rows in DDR2 DRAM. The timings were better.
    It just did not come back and ask for data from the RASed row.

    However, there are many commands, many more than in the 16kx1 DRAMs of
    old. What would make them go in the direction of simplifying the
    interface?

    Pins that are less expensive.

    The hardest part these days seems to be getting the high
    transfer rates to work, the rest of the interface is probably
    comparatively easy.

    This is from DDR4 and onward where one has to control drive strength
    and clock edge offsets (with a DLL) to transfer data that fast.

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Anton Ertl on Tue Feb 6 16:41:00 2024
    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    EricP <[email protected]> writes:
    A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
    So having a counter for each row is impractical.
    A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
    You are comparing a 16-bit incrementor and its associated flip-flop
    with a single transistor divided by the number of them in a word.

    I was thinking about counting each access only when the cache line is accessed. Then there needs to be only one incrementor per bank, and
    the counter can be stored in DRAM like the payload data.

    Dram row reads are destructive so a single row activate command
    internally has three cycles: read, sense and redrive, restore.

    The counter could be stored in the dram cells and the
    N-bit incrementer integrated into the bit line sense amp latches,
    such that when the activate command does its restore cycle
    it writes back the incremented counter.
    The incremented counter would also be available in the row buffer.

    Since the next precharge can't happen for 40-50 ns we have some
    time to decide what to do next.

    But thinking about it again, I wonder how counters would be reset.
    Maybe, when the counter reaches the limit, all lines in its blast
    radius are refereshed, and the counter of the present line is reset to
    0.

    On a row read if the counter hits its threshold limit the restore
    cycle writes back a count of 0, otherwise the incremented counter.

    The problem is with the +-4 blast radius refreshes. Each of those refreshes ages its neighbors which we need to track, so we can't reset those counters. This could cause a write amplification where each refresh repeatedly
    triggers 4 more refreshes.

    It is possible to use the counter as a state machine.
    Something like...
    1) For normal, periodic refreshes set count to some initial value.
    2) For reads increment count and if carry-out then reset to initial value
    and schedule immediate blast refresh of +-4 neighbor rows.
    3) For blast row refresh increment count but don't check for overflow.
    If there is a count overflow it gets detected on its next row read.

    Another disadvantage would be that we have to make decisions about
    possible rowhammering only based on one counter, and have to trigger refreshes of all lines in the blast radius based on worst-case
    scenarios (i.e., assuming that other rows in the blast radius have any
    count up to the limit).

    Yes, unless there is a way to infer the total counts for the neighbors.
    Bloom filter?
    But see below.

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600 trigger count. That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
    And the whole dram is refreshed every 64 ms reseting all the counters
    so the counts are not cumulative.

    That overhead is only going to grow as dram density increases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sat Feb 10 23:20:17 2024
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Would you rather have a few more refreshes or a few more ECC repairs ?!?
    with the potential for a few ECC repair fails ?!!?

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600 trigger count. That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
    And the whole dram is refreshed every 64 ms reseting all the counters
    so the counts are not cumulative.

    I think what RowPress tells us that waiting 60± ms and then refreshing every row
    is worse for data retention than spreading the refreshes out over the 64ms max interval rather evenly.

    That overhead is only going to grow as dram density increases.

    So are all the attack vectors.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Anton Ertl on Sat Feb 10 23:24:18 2024
    Anton Ertl wrote:

    EricP <[email protected]> writes:
    Anton Ertl wrote:
    A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
    Admittedly, if you just update the counter for a specific row and the
    refresh all rows in the blast radius when a limit is reached, you
    may get many more refreshes than the minimum necessary, but given that
    normal programs usually do not hammer specific row ranges, the
    additional refreshes may still be relatively few in non-attack
    situations (and when being attacked, you prefer lower DRAM performance
    to a successful attack).

    They said that the current threshold for causing flips in an immediate >>neighbor is 4800 activations, but with a blast radius of +-4 that
    can be in any of the 8 neighbors, so your counter threshold will have
    to trigger refresh at 1/8 of that level or every 600 activations.

    So only 10 bits of counter are necessary, reducing the overhead to
    0.125%:-).

    And as the dram features get smaller that threshold number will go down
    and probably the blast radius will go up. So this could have scaling
    issues in the future.

    Yes.

    If the DRAM manufactures placed a Faraday shield over the DRAM arrays
    {A gound plane} the blast radius goes from a linear charge sharing issue
    to a quadratic charge sharing issue. Such a ground plane is a layer of
    metal with a single <never changing> voltage on it. This might change the
    blast radius from 8 to 2.

    {{We did this kind of things for SRAM so we could run large signal count
    busses over the SRAM arrays.}}

    - anton

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to [email protected] on Sun Feb 11 13:20:50 2024
    [email protected] (MitchAlsup1) writes:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Would you rather have a few more refreshes or a few more ECC repairs ?!?
    with the potential for a few ECC repair fails ?!!?

    That's not the issue at hand here. The issue at hand here is whether
    the relatively cheap mechanism I described has an acceptable number of additional refreshes during normal operation, or whether a more
    expensive (in terms of area) mechanism is needed to fix Rowhammer.

    Concerning ECC, many computers do not have ECC memory, and for those
    that have it, ECC does not reliably fix Rowhammer; if it did, the fix
    would be simple: Use ECC, which is a good idea anyway, even if it
    costs 25% more chips in case of DDR5 DIMMs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <[email protected]>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to All on Sun Feb 11 10:46:02 2024
    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Would you rather have a few more refreshes or a few more ECC repairs ?!?
    with the potential for a few ECC repair fails ?!!?

    I believe Rowhammer and RowPress can flip many bits at once.
    Too many for SECDED.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600 trigger
    count.
    That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
    And the whole dram is refreshed every 64 ms reseting all the counters
    so the counts are not cumulative.

    I think what RowPress tells us that waiting 60± ms and then refreshing
    every row
    is worse for data retention than spreading the refreshes out over the
    64ms max
    interval rather evenly.

    Would any memory controller that would do that,
    refresh the whole dram in one big burst instead of periodically by row?
    I would expect doing so would introduce big stalls into memory access.

    64 ms / 8192 rows per block = 7.8125 us row interval.
    Lets say 50 ns row refresh time.
    So thats either 50 ns every 7.8 us
    verses 8192*50 ns = 409.6 us memory stall every 64 ms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to EricP on Sun Feb 11 19:57:34 2024
    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Would you rather have a few more refreshes or a few more ECC repairs ?!?
    with the potential for a few ECC repair fails ?!!?

    I believe Rowhammer and RowPress can flip many bits at once.
    Too many for SECDED.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600 trigger
    count.
    That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
    And the whole dram is refreshed every 64 ms reseting all the counters
    so the counts are not cumulative.

    I think what RowPress tells us that waiting 60± ms and then refreshing
    every row
    is worse for data retention than spreading the refreshes out over the
    64ms max
    interval rather evenly.

    Would any memory controller that would do that,
    refresh the whole dram in one big burst instead of periodically by row?
    I would expect doing so would introduce big stalls into memory access.

    64 ms / 8192 rows per block = 7.8125 us row interval.

    My DRAM controller (Opteron RevF) had a timer set about 7µs and if the
    back was active it would allow REF to slip. But on a second timer event
    it would interrupt data transfer and induce 2 refreshes to catch up. In general, this worked well as it almost never happened.

    Lets say 50 ns row refresh time.
    So thats either 50 ns every 7.8 us

    A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.

    When one changes page boundaries the HoB address bits are essentially randomized by the TLB:: why not just close the row at that point ?

    verses 8192*50 ns = 409.6 us memory stall every 64 ms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Mon Feb 12 17:14:26 2024
    On Sun, 11 Feb 2024 19:57:34 +0000
    [email protected] (MitchAlsup1) wrote:

    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Would you rather have a few more refreshes or a few more ECC
    repairs ?!? with the potential for a few ECC repair fails ?!!?

    I believe Rowhammer and RowPress can flip many bits at once.
    Too many for SECDED.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600
    trigger count.
    That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
    And the whole dram is refreshed every 64 ms reseting all the
    counters so the counts are not cumulative.

    I think what RowPress tells us that waiting 60� ms and then
    refreshing every row
    is worse for data retention than spreading the refreshes out over
    the 64ms max
    interval rather evenly.

    Would any memory controller that would do that,
    refresh the whole dram in one big burst instead of periodically by
    row? I would expect doing so would introduce big stalls into memory
    access.

    64 ms / 8192 rows per block = 7.8125 us row interval.

    My DRAM controller (Opteron RevF) had a timer set about 7�s and if the
    back was active it would allow REF to slip. But on a second timer
    event it would interrupt data transfer and induce 2 refreshes to
    catch up. In general, this worked well as it almost never happened.

    Lets say 50 ns row refresh time.
    So thats either 50 ns every 7.8 us

    A DDR5 at 6GBits/s transmits a 4096 byte page in 5�s.

    When one changes page boundaries the HoB address bits are essentially randomized by the TLB:: why not just close the row at that point ?

    verses 8192*50 ns = 409.6 us memory stall every 64 ms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Mon Feb 12 17:27:59 2024
    On Sun, 11 Feb 2024 19:57:34 +0000
    [email protected] (MitchAlsup1) wrote:

    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Would you rather have a few more refreshes or a few more ECC
    repairs ?!? with the potential for a few ECC repair fails ?!!?

    I believe Rowhammer and RowPress can flip many bits at once.
    Too many for SECDED.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600
    trigger count.
    That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
    And the whole dram is refreshed every 64 ms reseting all the
    counters so the counts are not cumulative.

    I think what RowPress tells us that waiting 60� ms and then
    refreshing every row
    is worse for data retention than spreading the refreshes out over
    the 64ms max
    interval rather evenly.

    Would any memory controller that would do that,
    refresh the whole dram in one big burst instead of periodically by
    row? I would expect doing so would introduce big stalls into memory
    access.

    64 ms / 8192 rows per block = 7.8125 us row interval.

    My DRAM controller (Opteron RevF) had a timer set about 7�s and if the
    back was active it would allow REF to slip. But on a second timer
    event it would interrupt data transfer and induce 2 refreshes to
    catch up. In general, this worked well as it almost never happened.

    Lets say 50 ns row refresh time.
    So thats either 50 ns every 7.8 us

    A DDR5 at 6GBits/s transmits a 4096 byte page in 5�s.


    DDR5 channel is 32-bit.
    4096B/(4B/T * 6e9 T/s) = 0.171 usec.
    Or for more 0.204 usec for more realistic rate of 5e9 T/s

    When one changes page boundaries the HoB address bits are essentially randomized by the TLB:: why not just close the row at that point ?


    Because memory controller is not aware of CPU page boundaries.
    Besides, in aarch64 world 16KB pages are rather common. And in x86
    world "transparent huge pages" are rather common.

    verses 8192*50 ns = 409.6 us memory stall every 64 ms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Michael S on Mon Feb 12 20:27:13 2024
    Michael S <[email protected]> writes:
    On Sun, 11 Feb 2024 19:57:34 +0000
    [email protected] (MitchAlsup1) wrote:

    Because memory controller is not aware of CPU page boundaries.
    Besides, in aarch64 world 16KB pages are rather common. And in x86
    world "transparent huge pages" are rather common.

    AArch64 supports translation granules of 4k, 16k and 64k. 4K
    and 64K are the most common. While the architecture defines
    16k, an implementation is free to not support it and I'm not aware of any widespread usage.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Scott Lurndal on Mon Feb 12 23:12:50 2024
    On Mon, 12 Feb 2024 20:27:13 GMT
    [email protected] (Scott Lurndal) wrote:

    Michael S <[email protected]> writes:
    On Sun, 11 Feb 2024 19:57:34 +0000
    [email protected] (MitchAlsup1) wrote:

    Because memory controller is not aware of CPU page boundaries.
    Besides, in aarch64 world 16KB pages are rather common. And in x86
    world "transparent huge pages" are rather common.

    AArch64 supports translation granules of 4k, 16k and 64k. 4K
    and 64K are the most common. While the architecture defines
    16k, an implementation is free to not support it and I'm not aware of
    any widespread usage.

    I think, 16KB is the main page size on Apple. Android is trying the
    same, but so far has problems.
    Apple+Android == approximately 101% of AArch64 total.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Mon Feb 12 22:45:08 2024
    Michael S wrote:

    On Sun, 11 Feb 2024 19:57:34 +0000
    [email protected] (MitchAlsup1) wrote:

    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary to
    prevent Rowhammer, but that approach may still be good enough.

    Would you rather have a few more refreshes or a few more ECC
    repairs ?!? with the potential for a few ECC repair fails ?!!?

    I believe Rowhammer and RowPress can flip many bits at once.
    Too many for SECDED.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600
    trigger count.
    That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
    And the whole dram is refreshed every 64 ms reseting all the
    counters so the counts are not cumulative.

    I think what RowPress tells us that waiting 60± ms and then
    refreshing every row
    is worse for data retention than spreading the refreshes out over
    the 64ms max
    interval rather evenly.

    Would any memory controller that would do that,
    refresh the whole dram in one big burst instead of periodically by
    row? I would expect doing so would introduce big stalls into memory
    access.

    64 ms / 8192 rows per block = 7.8125 us row interval.

    My DRAM controller (Opteron RevF) had a timer set about 7µs and if the
    back was active it would allow REF to slip. But on a second timer
    event it would interrupt data transfer and induce 2 refreshes to
    catch up. In general, this worked well as it almost never happened.

    Lets say 50 ns row refresh time.
    So thats either 50 ns every 7.8 us

    A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.


    DDR5 channel is 32-bit.
    4096B/(4B/T * 6e9 T/s) = 0.171 usec.
    Or for more 0.204 usec for more realistic rate of 5e9 T/s

    When one changes page boundaries the HoB address bits are essentially
    randomized by the TLB:: why not just close the row at that point ?


    Because memory controller is not aware of CPU page boundaries.

    Bits<19:12> changed. How hard is that to detect ??

    Besides, in aarch64 world 16KB pages are rather common. And in x86
    world "transparent huge pages" are rather common.

    Neither of which prevent closing the row to avoid memory retention
    issues.

    verses 8192*50 ns = 409.6 us memory stall every 64 ms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Tue Feb 13 01:15:28 2024
    On Mon, 12 Feb 2024 22:45:08 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:

    On Sun, 11 Feb 2024 19:57:34 +0000
    [email protected] (MitchAlsup1) wrote:

    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary
    to prevent Rowhammer, but that approach may still be good
    enough.

    Would you rather have a few more refreshes or a few more ECC
    repairs ?!? with the potential for a few ECC repair fails ?!!?


    I believe Rowhammer and RowPress can flip many bits at once.
    Too many for SECDED.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600
    trigger count.
    That triggers an extra 8 row refreshes, so 8/600 = 1.3%
    overhead. And the whole dram is refreshed every 64 ms reseting
    all the counters so the counts are not cumulative.

    I think what RowPress tells us that waiting 60� ms and then
    refreshing every row
    is worse for data retention than spreading the refreshes out
    over the 64ms max
    interval rather evenly.

    Would any memory controller that would do that,
    refresh the whole dram in one big burst instead of periodically
    by row? I would expect doing so would introduce big stalls into
    memory access.

    64 ms / 8192 rows per block = 7.8125 us row interval.

    My DRAM controller (Opteron RevF) had a timer set about 7�s and if
    the back was active it would allow REF to slip. But on a second
    timer event it would interrupt data transfer and induce 2
    refreshes to catch up. In general, this worked well as it almost
    never happened.
    Lets say 50 ns row refresh time.
    So thats either 50 ns every 7.8 us

    A DDR5 at 6GBits/s transmits a 4096 byte page in 5�s.


    DDR5 channel is 32-bit.
    4096B/(4B/T * 6e9 T/s) = 0.171 usec.
    Or for more 0.204 usec for more realistic rate of 5e9 T/s

    When one changes page boundaries the HoB address bits are
    essentially randomized by the TLB:: why not just close the row at
    that point ?

    Because memory controller is not aware of CPU page boundaries.

    Bits<19:12> changed. How hard is that to detect ??


    Do you always answer one statement before reading the next statement?

    Besides, in aarch64 world 16KB pages are rather common. And in x86
    world "transparent huge pages" are rather common.

    Neither of which prevent closing the row to avoid memory retention
    issues.


    What scenario of attack do you have in mind?
    I would think that neither in "classic" multi-side Row Hammer nor in Row
    Press attacker has to cross CPU page boundaries. If he (attacker)
    happens to know that memory controller likes to close DRAMraws on any particular address boundary, then he can easily avoid accessing last
    cache line before that particular boundary.

    BTW, all this attacks (or should I say, all this POCs, because I don't
    think that somebody ever caught real RH/RP attack launched by real bad
    guy) rather heavily depend on big or huge pages. They are close to
    impossible with small pages, even when "small" means 16 KB rather than
    4 KB.

    verses 8192*50 ns = 409.6 us memory stall every 64 ms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Michael S on Tue Feb 13 00:19:18 2024
    Michael S wrote:

    On Mon, 12 Feb 2024 22:45:08 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:

    On Sun, 11 Feb 2024 19:57:34 +0000
    [email protected] (MitchAlsup1) wrote:

    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than necessary
    to prevent Rowhammer, but that approach may still be good
    enough.

    Would you rather have a few more refreshes or a few more ECC
    repairs ?!? with the potential for a few ECC repair fails ?!!?


    I believe Rowhammer and RowPress can flip many bits at once.
    Too many for SECDED.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 = 600
    trigger count.
    That triggers an extra 8 row refreshes, so 8/600 = 1.3%
    overhead. And the whole dram is refreshed every 64 ms reseting
    all the counters so the counts are not cumulative.

    I think what RowPress tells us that waiting 60± ms and then
    refreshing every row
    is worse for data retention than spreading the refreshes out
    over the 64ms max
    interval rather evenly.

    Would any memory controller that would do that,
    refresh the whole dram in one big burst instead of periodically
    by row? I would expect doing so would introduce big stalls into
    memory access.

    64 ms / 8192 rows per block = 7.8125 us row interval.

    My DRAM controller (Opteron RevF) had a timer set about 7µs and if
    the back was active it would allow REF to slip. But on a second
    timer event it would interrupt data transfer and induce 2
    refreshes to catch up. In general, this worked well as it almost
    never happened.
    Lets say 50 ns row refresh time.
    So thats either 50 ns every 7.8 us

    A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.


    DDR5 channel is 32-bit.
    4096B/(4B/T * 6e9 T/s) = 0.171 usec.
    Or for more 0.204 usec for more realistic rate of 5e9 T/s

    When one changes page boundaries the HoB address bits are
    essentially randomized by the TLB:: why not just close the row at
    that point ?

    Because memory controller is not aware of CPU page boundaries.

    Bits<19:12> changed. How hard is that to detect ??


    Do you always answer one statement before reading the next statement?

    I actually wrote the above after writing the below.

    Besides, in aarch64 world 16KB pages are rather common. And in x86
    world "transparent huge pages" are rather common.

    Neither of which prevent closing the row to avoid memory retention
    issues.


    What scenario of attack do you have in mind?

    RowPress depends on keeping the row open too long--clearly evident in the charts in the document.

    I would think that neither in "classic" multi-side Row Hammer nor in Row Press attacker has to cross CPU page boundaries. If he (attacker)
    happens to know that memory controller likes to close DRAMraws on any particular address boundary, then he can easily avoid accessing last
    cache line before that particular boundary.

    RowHammer depends on closing the row too often.

    Performance (single CPU) depends on allowing the open row to service
    several pending requests streaming data at CAS access speeds.

    There is a balance to be found by preventing RowHammer from opening
    nearby rows too often and in preventing RowPress from holding them
    open for too long.

    I happen to think (without evidence beyond that of the rRowPress document)
    that the balance is distributing refreshes evenly across the refresh
    interval (as evidenced in the charts in RowPress document. It ends up
    that with modern DDR this enables about 4096 bytes to be read/written
    to a row before closing it (within a factor of 2-4).

    BTW, all this attacks (or should I say, all this POCs, because I don't
    think that somebody ever caught real RH/RP attack launched by real bad
    guy) rather heavily depend on big or huge pages. They are close to
    impossible with small pages, even when "small" means 16 KB rather than
    4 KB.

    verses 8192*50 ns = 409.6 us memory stall every 64 ms.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to [email protected] on Tue Feb 13 17:19:16 2024
    On Tue, 13 Feb 2024 00:19:18 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:

    On Mon, 12 Feb 2024 22:45:08 +0000
    [email protected] (MitchAlsup1) wrote:

    Michael S wrote:

    On Sun, 11 Feb 2024 19:57:34 +0000
    [email protected] (MitchAlsup1) wrote:

    EricP wrote:

    MitchAlsup1 wrote:
    EricP wrote:

    Anton Ertl wrote:
    [email protected] (MitchAlsup1) writes:
    Anton Ertl wrote:

    Both disadvantages lead to far more refreshes than
    necessary to prevent Rowhammer, but that approach may
    still be good enough.

    Would you rather have a few more refreshes or a few more ECC
    repairs ?!? with the potential for a few ECC repair fails
    ?!!?

    I believe Rowhammer and RowPress can flip many bits at once.
    Too many for SECDED.

    Lets see how bad this is.

    The single line threshold of 4800 and blast radius of 8 =
    600 trigger count.
    That triggers an extra 8 row refreshes, so 8/600 = 1.3%
    overhead. And the whole dram is refreshed every 64 ms
    reseting all the counters so the counts are not cumulative.


    I think what RowPress tells us that waiting 60� ms and then
    refreshing every row
    is worse for data retention than spreading the refreshes out
    over the 64ms max
    interval rather evenly.

    Would any memory controller that would do that,
    refresh the whole dram in one big burst instead of
    periodically by row? I would expect doing so would introduce
    big stalls into memory access.

    64 ms / 8192 rows per block = 7.8125 us row interval.

    My DRAM controller (Opteron RevF) had a timer set about 7�s and
    if the back was active it would allow REF to slip. But on a
    second timer event it would interrupt data transfer and induce 2
    refreshes to catch up. In general, this worked well as it almost
    never happened.
    Lets say 50 ns row refresh time.
    So thats either 50 ns every 7.8 us

    A DDR5 at 6GBits/s transmits a 4096 byte page in 5�s.


    DDR5 channel is 32-bit.
    4096B/(4B/T * 6e9 T/s) = 0.171 usec.
    Or for more 0.204 usec for more realistic rate of 5e9 T/s

    When one changes page boundaries the HoB address bits are
    essentially randomized by the TLB:: why not just close the row
    at that point ?

    Because memory controller is not aware of CPU page boundaries.


    Bits<19:12> changed. How hard is that to detect ??


    Do you always answer one statement before reading the next
    statement?

    I actually wrote the above after writing the below.

    Besides, in aarch64 world 16KB pages are rather common. And in
    x86 world "transparent huge pages" are rather common.

    Neither of which prevent closing the row to avoid memory retention
    issues.


    What scenario of attack do you have in mind?

    RowPress depends on keeping the row open too long--clearly evident in
    the charts in the document.


    Clarification for casual observers that didn't bother to read Row Press
    paper: RowPress attack does not depends on keeping row open
    continuously.
    Short interruptions actually greatly improve effectiveness of attack significantly increasing BER for a given duration of attack. After
    all, RowPress *is* a variant of RowHammer.
    For a given interruption rate, longer interruptions reduce effectiveness
    of attack, but not dramatically so. For example, for most practically
    important interruption rate of 128 KHz (period=7.81 usec) increasing
    duration of off interval from absolute minimum allowed by protocol
    (~50ns) to 2 usec reduces efficiency of attack only by factor of 2 o 3.


    I would think that neither in "classic" multi-side Row Hammer nor
    in Row Press attacker has to cross CPU page boundaries. If he
    (attacker) happens to know that memory controller likes to close
    DRAMraws on any particular address boundary, then he can easily
    avoid accessing last cache line before that particular boundary.

    RowHammer depends on closing the row too often.


    Yes, except that it is unknown whether major RH impact is done by
    closing the row or by opening it. The later is more likely. But since
    the rate of opening and closing is the same, this finer difference is
    not important.

    Performance (single CPU) depends on allowing the open row to service
    several pending requests streaming data at CAS access speeds.

    There is a balance to be found by preventing RowHammer from opening
    nearby rows too often and in preventing RowPress from holding them
    open for too long.


    There is no balance. Opening nearby rows too often helps both variants
    of attack.

    I happen to think (without evidence beyond that of the rRowPress
    document) that the balance is distributing refreshes evenly across
    the refresh interval (as evidenced in the charts in RowPress
    document. It ends up that with modern DDR this enables about 4096
    bytes to be read/written to a row before closing it (within a factor
    of 2-4).


    Huh?
    DDR4-3200 channel transfers data at rate approaching 25.6 GB/s. DDR5
    will be the same when it reaches it's projected maximum speed of 6400.
    25.6 GB/s * 7.81 usec = 200,000 bytes. That's a factor of 49 rather than
    2-4.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Tue Feb 13 11:24:10 2024
    Michael S wrote:
    On Tue, 13 Feb 2024 00:19:18 +0000
    [email protected] (MitchAlsup1) wrote:
    RowPress depends on keeping the row open too long--clearly evident in
    the charts in the document.


    Clarification for casual observers that didn't bother to read Row Press paper: RowPress attack does not depends on keeping row open
    continuously.
    Short interruptions actually greatly improve effectiveness of attack significantly increasing BER for a given duration of attack. After
    all, RowPress *is* a variant of RowHammer.

    RowPress documents that keeping the aggressor row open longer lowers
    the limit on the adjacent rows before opens (RowHammers) causes bit flips.
    Also the paper notes that DRAM manufacturers, eg Micron and Samsung,
    already document that keeping a row open longer can cause read-disturbance. What's new is the paper documents the interaction between row activation
    time and the subsequent number of opens (RowHammers) needed to flip a bit.

    Also note that different bits are susceptible to RowPress and RowHammer.
    See section 4.3

    RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023 https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf

    "RowPress breaks memory isolation by keeping a DRAM row open for a long
    period of time, which disturbs physically nearby rows enough to cause
    bitflips. We show that RowPress amplifies DRAM’s vulnerability to read-disturb attacks by significantly reducing the number of row
    activations needed to induce a bitflip by one to two orders of
    magnitude under realistic conditions. In extreme cases, RowPress induces bitflips in a DRAM row when an adjacent row is activated only once."

    "We show that keeping a DRAM row (i.e., aggressor row) open for a long
    period of time (i.e., a large aggressor row ON time, tAggON) disturbs physically nearby DRAM rows. Doing so induces bitflips in the victim row without requiring (tens of) thousands of activations to the aggressor row."

    For a given interruption rate, longer interruptions reduce effectiveness
    of attack, but not dramatically so. For example, for most practically important interruption rate of 128 KHz (period=7.81 usec) increasing
    duration of off interval from absolute minimum allowed by protocol
    (~50ns) to 2 usec reduces efficiency of attack only by factor of 2 o 3.

    Reduced by a factor of up to 363. Under figure 1.

    "We observe that as tAggON increases, compared to the most effective
    RowHammer pattern, the most effective Row-Press pattern reduces ACmin
    1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
    refresh interval (7.8 μs),
    2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
    the maximum allowed tAggON, and
    3) down to only one activation for an extreme tAggON of 30 ms
    (highlighted by dashed red boxes).

    Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON increases."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Tue Feb 13 19:00:30 2024
    On Tue, 13 Feb 2024 11:24:10 -0500
    EricP <[email protected]> wrote:

    Michael S wrote:
    On Tue, 13 Feb 2024 00:19:18 +0000
    [email protected] (MitchAlsup1) wrote:
    RowPress depends on keeping the row open too long--clearly evident
    in the charts in the document.


    Clarification for casual observers that didn't bother to read Row
    Press paper: RowPress attack does not depends on keeping row open continuously.
    Short interruptions actually greatly improve effectiveness of attack significantly increasing BER for a given duration of attack. After
    all, RowPress *is* a variant of RowHammer.

    RowPress documents that keeping the aggressor row open longer lowers
    the limit on the adjacent rows before opens (RowHammers) causes bit
    flips.

    Correct, but irrelevant.

    Also the paper notes that DRAM manufacturers, eg Micron and
    Samsung, already document that keeping a row open longer can cause read-disturbance. What's new is the paper documents the interaction
    between row activation time and the subsequent number of opens
    (RowHammers) needed to flip a bit.


    Correct and relevant, but not to the issue at hand which is criticism
    of Mitch's ideas of mitigation.

    Also note that different bits are susceptible to RowPress and
    RowHammer. See section 4.3

    RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023 https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf

    "RowPress breaks memory isolation by keeping a DRAM row open for a
    long period of time, which disturbs physically nearby rows enough to
    cause bitflips. We show that RowPress amplifies DRAM’s vulnerability
    to read-disturb attacks by significantly reducing the number of row activations needed to induce a bitflip by one to two orders of
    magnitude under realistic conditions. In extreme cases, RowPress
    induces bitflips in a DRAM row when an adjacent row is activated only
    once."

    "We show that keeping a DRAM row (i.e., aggressor row) open for a long
    period of time (i.e., a large aggressor row ON time, tAggON) disturbs physically nearby DRAM rows. Doing so induces bitflips in the victim
    row without requiring (tens of) thousands of activations to the
    aggressor row."

    For a given interruption rate, longer interruptions reduce
    effectiveness of attack, but not dramatically so. For example, for
    most practically important interruption rate of 128 KHz
    (period=7.81 usec) increasing duration of off interval from
    absolute minimum allowed by protocol (~50ns) to 2 usec reduces
    efficiency of attack only by factor of 2 o 3.

    Reduced by a factor of up to 363. Under figure 1.

    "We observe that as tAggON increases, compared to the most effective RowHammer pattern, the most effective Row-Press pattern reduces ACmin
    1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
    refresh interval (7.8 μs),
    2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
    the maximum allowed tAggON, and
    3) down to only one activation for an extreme tAggON of 30 ms
    (highlighted by dashed red boxes).

    Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
    increases."


    ACmin by itself is a wrong measure of efficiency of attack.
    The right measure is reciprocal of the total duration of attack.
    At any given duty cycle reciprocal of the total duration of attack
    grows with increased rate of interruptions (a.k.a. hammering rate).
    The general trend is the same as for all other RH variants, the only
    difference that dependency on hammering rate is somewhat weaker.

    Relatively weak influence of duty cycle itself is shown in figure 22.

    The practical significance of RowPress is due to two factors.
    (1) is the factor is the one you mentioned above - it can flip
    different bits from those flippable by other RH variants.
    (2) is that it is not affected at all by DDR4 TRR
    attempt of mitigation.

    The third, less important factor is that RowPress appears quite robust
    to differences between major manufacturers.

    However, one should not overlook that efficiency of RowPress attacks
    when measured by the most important criterion of BER per duration of
    attack is many times lower than earlier techniques of double-sided and multi-sided hammering.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Tue Feb 13 12:05:18 2024
    Michael S wrote:
    On Tue, 13 Feb 2024 00:19:18 +0000
    [email protected] (MitchAlsup1) wrote:
    RowHammer depends on closing the row too often.

    Yes, except that it is unknown whether major RH impact is done by
    closing the row or by opening it. The later is more likely. But since
    the rate of opening and closing is the same, this finer difference is
    not important.

    A Deeper Look into RowHammers Sensitivities Experimental Analysis
    of Real DRAM Chips and Implications on Future Attacks and Defenses, 2021 https://arxiv.org/pdf/2110.10291

    That paper pre-dates the RowPress one and notes:

    "6.1 Impact of Aggressor Row�s On-Time

    Obsv. 8. As the aggressor row stays active longer (i.e., tAggON increases), more DRAM cells experience RowHammer bit flips and they
    experience RowHammer bit flips at lower hammer counts."

    Obsv. 9. RowHammer vulnerability consistently worsens as tAggON
    increases in DRAM chips from all four manufacturers.

    6.2 Impact of Aggressor Row�s Off-Time

    Obsv. 10. As the bank stays precharged longer (i.e., tAggOFF increases),
    fewer DRAM cells experience RowHammer bit flips and they
    experience RowHammer bit flips at higher hammer counts.

    Obsv. 11. RowHammer vulnerability consistently reduces as
    tAggOFF increases in DRAM chips from all four manufacturers."

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Wed Feb 14 10:50:28 2024
    On Tue, 13 Feb 2024 12:05:18 -0500
    EricP <[email protected]> wrote:

    Michael S wrote:
    On Tue, 13 Feb 2024 00:19:18 +0000
    [email protected] (MitchAlsup1) wrote:
    RowHammer depends on closing the row too often.

    Yes, except that it is unknown whether major RH impact is done by
    closing the row or by opening it. The later is more likely. But
    since the rate of opening and closing is the same, this finer
    difference is not important.

    A Deeper Look into RowHammers Sensitivities Experimental Analysis
    of Real DRAM Chips and Implications on Future Attacks and Defenses,
    2021 https://arxiv.org/pdf/2110.10291

    That paper pre-dates the RowPress one and notes:

    "6.1 Impact of Aggressor Row�s On-Time

    Obsv. 8. As the aggressor row stays active longer (i.e., tAggON
    increases), more DRAM cells experience RowHammer bit flips and they experience RowHammer bit flips at lower hammer counts."

    Obsv. 9. RowHammer vulnerability consistently worsens as tAggON
    increases in DRAM chips from all four manufacturers.

    6.2 Impact of Aggressor Row�s Off-Time

    Obsv. 10. As the bank stays precharged longer (i.e., tAggOFF
    increases), fewer DRAM cells experience RowHammer bit flips and they experience RowHammer bit flips at higher hammer counts.

    Obsv. 11. RowHammer vulnerability consistently reduces as
    tAggOFF increases in DRAM chips from all four manufacturers."







    novaBBS is not updating since yesterday, so Mitch is not aware of
    our latest posts.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Wed Feb 14 10:51:47 2024
    Michael S wrote:
    On Tue, 13 Feb 2024 11:24:10 -0500
    EricP <[email protected]> wrote:
    Michael S wrote:
    On Tue, 13 Feb 2024 00:19:18 +0000
    [email protected] (MitchAlsup1) wrote:
    RowPress depends on keeping the row open too long--clearly evident
    in the charts in the document.

    Clarification for casual observers that didn't bother to read Row
    Press paper: RowPress attack does not depends on keeping row open
    continuously.
    Short interruptions actually greatly improve effectiveness of attack
    significantly increasing BER for a given duration of attack. After
    all, RowPress *is* a variant of RowHammer.
    RowPress documents that keeping the aggressor row open longer lowers
    the limit on the adjacent rows before opens (RowHammers) causes bit
    flips.

    Correct, but irrelevant.

    It was kinda the whole point of the RowPress paper.

    Also the paper notes that DRAM manufacturers, eg Micron and
    Samsung, already document that keeping a row open longer can cause
    read-disturbance. What's new is the paper documents the interaction
    between row activation time and the subsequent number of opens
    (RowHammers) needed to flip a bit.


    Correct and relevant, but not to the issue at hand which is criticism
    of Mitch's ideas of mitigation.

    Also note that different bits are susceptible to RowPress and
    RowHammer. See section 4.3

    RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
    https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf

    I just found out that there are two different versions of the RowPress
    paper and I was looking at the older one. The updated version is:

    RowPress: Amplifying Read Disturbance in Modern DRAM Chips, 2023 https://arxiv.org/pdf/2306.17061.pdf

    For a given interruption rate, longer interruptions reduce
    effectiveness of attack, but not dramatically so. For example, for
    most practically important interruption rate of 128 KHz
    (period=7.81 usec) increasing duration of off interval from
    absolute minimum allowed by protocol (~50ns) to 2 usec reduces
    efficiency of attack only by factor of 2 o 3.
    Reduced by a factor of up to 363. Under figure 1.

    "We observe that as tAggON increases, compared to the most effective
    RowHammer pattern, the most effective Row-Press pattern reduces ACmin
    1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
    refresh interval (7.8 μs),
    2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
    the maximum allowed tAggON, and
    3) down to only one activation for an extreme tAggON of 30 ms
    (highlighted by dashed red boxes).

    Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
    increases."


    ACmin by itself is a wrong measure of efficiency of attack.

    I'm not interested in the efficiency of the attack.
    ACmin, the minimum absolute count of opens above which we lose data
    is the number I'm interested in.

    The right measure is reciprocal of the total duration of attack.
    At any given duty cycle reciprocal of the total duration of attack
    grows with increased rate of interruptions (a.k.a. hammering rate).
    The general trend is the same as for all other RH variants, the only difference that dependency on hammering rate is somewhat weaker.

    Relatively weak influence of duty cycle itself is shown in figure 22.

    Looking at figure 22 on the arxiv version of the paper,
    this is a completely different test. This test was to explain the
    discrepancy between the RowPress results and the earlier cited papers.

    BER is the fraction of DRAM cells in a DRAM row that experience bitflips.
    Its a different measure because RowPress detects when ANY data loss begins,
    not the fraction of lost data bits (efficiency) after it kicks in.

    Obsv 16 explains it, the BER for the bottom two lines,
    which are the ones with a long total tA2A, goes up in all graphs
    by between a factor of 10 to about 500, which is the RowPress effect.

    To my eye what this test shows is the PRE phase may *heal* some of the
    damaging effects that the ACT phase causes, but only to a certain point. Possibly the PRE phase scavenges the ACT hot injection carriers.

    The practical significance of RowPress is due to two factors.
    (1) is the factor is the one you mentioned above - it can flip
    different bits from those flippable by other RH variants.
    (2) is that it is not affected at all by DDR4 TRR
    attempt of mitigation.

    I take away something completely different: there are multiple interacting error mechanisms at work here. RowHammer and RowPress are likely
    completely different physics and fixing one won't fix the other.

    It also suggests there may be other similar mechanisms waiting to be found.

    The third, less important factor is that RowPress appears quite robust
    to differences between major manufacturers.

    However, one should not overlook that efficiency of RowPress attacks
    when measured by the most important criterion of BER per duration of
    attack is many times lower than earlier techniques of double-sided and multi-sided hammering.

    For me the BER is irrelevant if it is above 0.0.
    I want to know where the errors start which is ACmin.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to EricP on Wed Feb 14 19:46:36 2024
    On Wed, 14 Feb 2024 10:51:47 -0500
    EricP <[email protected]> wrote:

    Michael S wrote:
    On Tue, 13 Feb 2024 11:24:10 -0500
    EricP <[email protected]> wrote:
    Michael S wrote:
    On Tue, 13 Feb 2024 00:19:18 +0000
    [email protected] (MitchAlsup1) wrote:
    RowPress depends on keeping the row open too long--clearly
    evident in the charts in the document.

    Clarification for casual observers that didn't bother to read Row
    Press paper: RowPress attack does not depends on keeping row open
    continuously.
    Short interruptions actually greatly improve effectiveness of
    attack significantly increasing BER for a given duration of
    attack. After all, RowPress *is* a variant of RowHammer.
    RowPress documents that keeping the aggressor row open longer
    lowers the limit on the adjacent rows before opens (RowHammers)
    causes bit flips.

    Correct, but irrelevant.

    It was kinda the whole point of the RowPress paper.

    Also the paper notes that DRAM manufacturers, eg Micron and
    Samsung, already document that keeping a row open longer can cause
    read-disturbance. What's new is the paper documents the interaction
    between row activation time and the subsequent number of opens
    (RowHammers) needed to flip a bit.


    Correct and relevant, but not to the issue at hand which is
    criticism of Mitch's ideas of mitigation.

    Also note that different bits are susceptible to RowPress and
    RowHammer. See section 4.3

    RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
    https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf

    I just found out that there are two different versions of the RowPress
    paper and I was looking at the older one. The updated version is:

    RowPress: Amplifying Read Disturbance in Modern DRAM Chips, 2023 https://arxiv.org/pdf/2306.17061.pdf

    For a given interruption rate, longer interruptions reduce
    effectiveness of attack, but not dramatically so. For example, for
    most practically important interruption rate of 128 KHz
    (period=7.81 usec) increasing duration of off interval from
    absolute minimum allowed by protocol (~50ns) to 2 usec reduces
    efficiency of attack only by factor of 2 o 3.
    Reduced by a factor of up to 363. Under figure 1.

    "We observe that as tAggON increases, compared to the most
    effective RowHammer pattern, the most effective Row-Press pattern
    reduces ACmin 1) by 17.6× on average (up to 40.7×) when tAggON is
    as large as the refresh interval (7.8 μs),
    2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
    the maximum allowed tAggON, and
    3) down to only one activation for an extreme tAggON of 30 ms
    (highlighted by dashed red boxes).

    Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
    increases."


    ACmin by itself is a wrong measure of efficiency of attack.

    I'm not interested in the efficiency of the attack.
    ACmin, the minimum absolute count of opens above which we lose data
    is the number I'm interested in.


    You may be interested, but I don't understand why.
    For me, the important thing is how much time it take until probability
    of the flip become significant.
    Suppose, attack (A) hammers at 5 MHz and has ACmin=5e4. Attack (B)
    hammers at 0.13 MHz (typical for RP in real-world setup) and has
    ACmin=3e3.
    Then I'd say that attack (A) is 2.3 times more dangerous.

    Back to real world, researchers demonstrated that multi-side
    hammering can have ACmin that is significantly lower than our imaginary
    attack (A), so the only remaining question is how fast can we hammer
    without triggering TRR. My 5MHz number probably hard to achieve for
    attacker, but 2-3 MHz sound doable.

    The right measure is reciprocal of the total duration of attack.
    At any given duty cycle reciprocal of the total duration of attack
    grows with increased rate of interruptions (a.k.a. hammering rate).
    The general trend is the same as for all other RH variants, the only difference that dependency on hammering rate is somewhat weaker.

    Relatively weak influence of duty cycle itself is shown in figure
    22.

    Looking at figure 22 on the arxiv version of the paper,
    this is a completely different test. This test was to explain the
    discrepancy between the RowPress results and the earlier cited papers.

    BER is the fraction of DRAM cells in a DRAM row that experience
    bitflips. Its a different measure because RowPress detects when ANY
    data loss begins, not the fraction of lost data bits (efficiency)
    after it kicks in.

    Obsv 16 explains it, the BER for the bottom two lines,
    which are the ones with a long total tA2A, goes up in all graphs
    by between a factor of 10 to about 500, which is the RowPress effect.

    To my eye what this test shows is the PRE phase may *heal* some of the damaging effects that the ACT phase causes, but only to a certain
    point. Possibly the PRE phase scavenges the ACT hot injection
    carriers.

    The practical significance of RowPress is due to two factors.
    (1) is the factor is the one you mentioned above - it can flip
    different bits from those flippable by other RH variants.
    (2) is that it is not affected at all by DDR4 TRR
    attempt of mitigation.

    I take away something completely different: there are multiple
    interacting error mechanisms at work here. RowHammer and RowPress are
    likely completely different physics and fixing one won't fix the
    other.


    Different like coupling in different frequency bands - yes.
    But both caused by insufficient isolation.

    It also suggests there may be other similar mechanisms waiting to be
    found.

    The third, less important factor is that RowPress appears quite
    robust to differences between major manufacturers.

    However, one should not overlook that efficiency of RowPress attacks
    when measured by the most important criterion of BER per duration of
    attack is many times lower than earlier techniques of double-sided
    and multi-sided hammering.

    For me the BER is irrelevant if it is above 0.0.
    I want to know where the errors start which is ACmin.


    So, call it time to first flip. The principle is the same.
    Still, MSRH causes harm faster than RP.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From EricP@21:1/5 to Michael S on Thu Feb 15 18:27:28 2024
    Michael S wrote:
    On Wed, 14 Feb 2024 10:51:47 -0500
    EricP <[email protected]> wrote:
    Michael S wrote:
    On Tue, 13 Feb 2024 11:24:10 -0500
    EricP <[email protected]> wrote:

    "We observe that as tAggON increases, compared to the most
    effective RowHammer pattern, the most effective Row-Press pattern
    reduces ACmin 1) by 17.6× on average (up to 40.7×) when tAggON is
    as large as the refresh interval (7.8 μs),
    2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
    the maximum allowed tAggON, and
    3) down to only one activation for an extreme tAggON of 30 ms
    (highlighted by dashed red boxes).

    Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
    increases."

    ACmin by itself is a wrong measure of efficiency of attack.
    I'm not interested in the efficiency of the attack.
    ACmin, the minimum absolute count of opens above which we lose data
    is the number I'm interested in.

    You may be interested, but I don't understand why.
    For me, the important thing is how much time it take until probability
    of the flip become significant.

    Because in terms of designing a memory controller
    *any* bit flip due to RH or RP is unacceptable.
    After RH/RP starts to inject errors the rate it does so
    doesn't matter because the memory bank is corrupt.

    Suppose, attack (A) hammers at 5 MHz and has ACmin=5e4. Attack (B)
    hammers at 0.13 MHz (typical for RP in real-world setup) and has
    ACmin=3e3.
    Then I'd say that attack (A) is 2.3 times more dangerous.

    That can tell you that dram A is more susceptible to a RH attack than B.

    But what matters to a dram controller is whether ACmin opens can be
    reached inside the refresh interval of 64 ms. After that minimum is reached, how fast it corrupts memory in flips/sec is irrelevant since the number
    of corrupted bits is more than are correctable by SECDED.

    Back to real world, researchers demonstrated that multi-side
    hammering can have ACmin that is significantly lower than our imaginary attack (A), so the only remaining question is how fast can we hammer
    without triggering TRR. My 5MHz number probably hard to achieve for
    attacker, but 2-3 MHz sound doable.

    The RowHammer fix Target Row Refresh TRR is triggered when the Maximum
    Activate Count MAC is reached within the Maximum Activate Window time tMAW.
    It doesn't matter how opens are distributed in time within tMAW.
    It looks like tMAW is the chip refresh interval of 64 ms.
    When MAC is reached TRR must immediately refresh the two adjacent rows.

    The problem with TRR is that the controller (presumably) reads the
    MAC and tMAW values from the DRAM configuration registers.
    However RowPress shows that holding a row open greatly lowers the MAC
    trigger level, bypassing the TRR fix.

    Also as Blaster shows, TRR refreshing the two adjacent rows is not enough.
    It would need to refresh +-4 rows, and that would also further divide
    the MAC trigger limit by 8.

    The right measure is reciprocal of the total duration of attack.
    At any given duty cycle reciprocal of the total duration of attack
    grows with increased rate of interruptions (a.k.a. hammering rate).
    The general trend is the same as for all other RH variants, the only
    difference that dependency on hammering rate is somewhat weaker.

    Relatively weak influence of duty cycle itself is shown in figure
    22.
    Looking at figure 22 on the arxiv version of the paper,
    this is a completely different test. This test was to explain the
    discrepancy between the RowPress results and the earlier cited papers.

    BER is the fraction of DRAM cells in a DRAM row that experience
    bitflips. Its a different measure because RowPress detects when ANY
    data loss begins, not the fraction of lost data bits (efficiency)
    after it kicks in.

    Obsv 16 explains it, the BER for the bottom two lines,
    which are the ones with a long total tA2A, goes up in all graphs
    by between a factor of 10 to about 500, which is the RowPress effect.

    To my eye what this test shows is the PRE phase may *heal* some of the
    damaging effects that the ACT phase causes, but only to a certain
    point. Possibly the PRE phase scavenges the ACT hot injection
    carriers.

    The practical significance of RowPress is due to two factors.
    (1) is the factor is the one you mentioned above - it can flip
    different bits from those flippable by other RH variants.
    (2) is that it is not affected at all by DDR4 TRR
    attempt of mitigation.
    I take away something completely different: there are multiple
    interacting error mechanisms at work here. RowHammer and RowPress are
    likely completely different physics and fixing one won't fix the
    other.


    Different like coupling in different frequency bands - yes.
    But both caused by insufficient isolation.

    I'm just guessing, based on the reports that different bits are affected
    for RH and RP, and that RH flips 0's to 1's while RP flips 1's to 0's.
    But I don't think they have had time to look at the details for RP yet.

    It also suggests there may be other similar mechanisms waiting to be
    found.

    The third, less important factor is that RowPress appears quite
    robust to differences between major manufacturers.

    However, one should not overlook that efficiency of RowPress attacks
    when measured by the most important criterion of BER per duration of
    attack is many times lower than earlier techniques of double-sided
    and multi-sided hammering.
    For me the BER is irrelevant if it is above 0.0.
    I want to know where the errors start which is ACmin.


    So, call it time to first flip. The principle is the same.
    Still, MSRH causes harm faster than RP.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)