I've knocked up a little utility program to try to work out some
performance figures for my CPU.
It's an AMD Ryzen™ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache
What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.
A C++ fragment is this. I can post the whole thing if it would help.
// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;
Stopwatch s;
s.start();
while (1) // until break when mask runs out
{
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time
if (mask == 0) break; // Stop if we've run out of mask
mask >>= 1; // shrink the mask
}
As you can see it starts with a large mask (in fact for a whole GB)
and halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large
mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as
the mask gets smaller. No apparent effect when it gets under the L1
cache size.
But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the
L1 data cache.
A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this
odd slow down with small masks.
What am I missing?
Thanks
Andy
for (size_t index = 0; index < storeWordCount; ++index)...
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.
A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a
read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.
What am I missing?
Vir Campestris wrote:
As you can see it starts with a large mask (in fact for a whole GB)
and halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large
mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
as the mask gets smaller. No apparent effect when it gets under the
L1 cache size.
The execution window is apparently able to absorb the latency of L3
miss, and stream L3->L1 accesses.
Anton answered the question regarding small masks.
As you can see it starts with a large mask (in fact for a whole GB) and halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.
I've knocked up a little utility program to try to work out some
performance figures for my CPU.
It's an AMD Ryzenâ„¢ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache
What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.
A C++ fragment is this. I can post the whole thing if it would help.
// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;
Stopwatch s;
s.start();
while (1) // until break when mask runs out
{
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time
if (mask == 0) break; // Stop if we've run out of mask
mask >>= 1; // shrink the mask
}
As you can see it starts with a large mask (in fact for a whole GB) and halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large mask,
at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that as the mask
gets smaller. No apparent effect when it gets under the L1 cache size.
But...
When the mask is very small (3) it slows to 18GB/s. With 1 it halves
again, and with zero (so it only operates on the same word over and
over) it's half again. A fifth of the size with a large block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the L1
data cache.
A late thought was to replace that ^= index with something that reads
the memory only, or that writes it only, instead of doing a read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this odd
slow down with small masks.
Rowhammer protection:
It is possible that the pattern of re-XORing the same or a small number=20
of locations over and over could trigger a pattern detector which was=20 >designed to mitigate against Rowhammer.
OTOH, this would much more easily be handled with memory range based=20 >coalescing of write operations in the last level cache, right?
I.e. for normal (write combining) memory
it would (afaik) be legal to=20
delay the actual writes to RAM for a significant time, long enough to=20 >merge multiple memory writes.
Vir Campestris wrote:
I've knocked up a little utility program to try to work out some performance figures for my CPU.
It's an AMD Ryzenâ„¢ 5 3400G. It says on the spec it has:
4MB L3 cache
2MB L2 cache
384kb L1 cache
What I do is to xor a location in memory in an array many times.
The size of the area I xor over is set by a mask on the store index.
The words in the store are 64 bit.
A C++ fragment is this. I can post the whole thing if it would help.
// Calculate a bit mask for the entire store
Word mask = storeWordCount - 1;
Stopwatch s;
s.start();
while (1) // until break when mask runs out
{
for (size_t index = 0; index < storeWordCount; ++index)
{
// read and write a word in store.
Raw[index & mask] ^= index;
}
s.lap(mask); // records the current time
if (mask == 0) break; // Stop if we've run out of mask
mask >>= 1; // shrink the mask
}
As you can see it starts with a large mask (in fact for a whole GB)
and halves it as it goes around.
All looks fine at first. I get about 8GB per second with a large
mask, at 4MB it goes up to 15GB/s, at 8MB up to 23. It holds that
as the mask gets smaller. No apparent effect when it gets under the
L1 cache size.
But...
When the mask is very small (3) it slows to 18GB/s. With 1 it
halves again, and with zero (so it only operates on the same word
over and over) it's half again. A fifth of the size with a large
block.
Something odd is happening here when I hammer the same location (32
bytes and on down) so that it's slower. Yet this ought to be in the
L1 data cache.
A late thought was to replace that ^= index with something that
reads the memory only, or that writes it only, instead of doing a read-modify-write cycle. That gives me much faster performance with
writes than reads. And neither read only, nor write only, show this
odd slow down with small masks.
Mitch, Anton and Michael have already answered, I just want to add
that we have one additional potential factor:
Rowhammer protection:
It is possible that the pattern of re-XORing the same or a small
number of locations over and over could trigger a pattern detector
which was designed to mitigate against Rowhammer.
OTOH, this would much more easily be handled with memory range based coalescing of write operations in the last level cache, right?
I.e. for normal (write combining) memory, it would (afaik) be legal
to delay the actual writes to RAM for a significant time, long enough
to merge multiple memory writes.
Terje
On Wed, 31 Jan 2024 07:59:41 +0100
Terje Mathisen <[email protected]> wrote:
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of >CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Michael S <[email protected]> writes:
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to each >>other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of >>CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a >>mistake, but that mistake can't be fixed without breaking things. >>Considering that CLFLUSH exists since very early 2000s, it is >>understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices.
An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol. This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
However, AFAIK this is insufficient for fixing Rowhammer.
Caches have relatively limited associativity, up to something like 16-way set-associativity, so if you write to the same set 17 times, you are guaranteed to miss the cache. With 3 levels of cache you may need 49 accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go. With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
- anton
Michael S <[email protected]> writes:
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples
rely on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to
each other and with respect to writes, locked read-modify-write >instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of >CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page >attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol.
This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
However, AFAIK this is insufficient for fixing Rowhammer. Caches have relatively limited associativity, up to something like 16-way set-associativity, so if you write to the same set 17 times, you are guaranteed to miss the cache. With 3 levels of cache you may need 49 accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go.
With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
- anton
On Wed, 31 Jan 2024 17:17:21 GMT
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples
rely on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to
each other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol.
Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
and that at that time all Intel's PCI/AGP root hubs were already fully I/O-coherent for several years, I find your theory unlikely.
Myself, I don't know the original reason, but I do know a use case
where CLFLUSH, while not strictly necessary, simplifies things greatly
- entering deep sleep state in which CPU caches are powered down and
DRAM put in self-refresh mode.
Of course, this particular use case does not require *non-priviledged* CLFLUSH, so obviously Intel had different reason.
This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is
concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
However, AFAIK this is insufficient for fixing Rowhammer. Caches have
relatively limited associativity, up to something like 16-way
set-associativity, so if you write to the same set 17 times, you are
guaranteed to miss the cache. With 3 levels of cache you may need 49
accesses (probably less), but I expect that the resulting DRAM
accesses to a cache line are still not rare enough that Rowhammer
cannot happen.
Original RH required very high hammering rate that certainly can't be achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.
Today we have yet another variant called RowPress that bypasses TRR mitigation more reliably than mult-rate RH. I think this one would be practically impossible without CLFLUSH., esp. when system under attack carries other DRAM accesses in parallel with attackers code.
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go.
IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent.
POC authors
typically run lengthy tests in order to figure it out.
With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
- anton
They cared enough to implement the simplest of proposed solutions - TRR.
Yes, it was quickly found insufficient, but at least there was a demonstration of good intentions.
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Michael S <[email protected]> wrote:
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
For Arm, with its non-coherent data and instruction caches, you need
some way to flush dcache to the point of unification in order to make instruction changes visible. Also, regardless of icache coherence,
when using non-volatile memory you need an efficient way to flush
dcache to the point of peristence. You need that in order to make
sure that a transaction has been written to a log.
With the latter, you could restrict dcache flushes to pages with a
particular non-volatile attribute. I don't think there's anything you
can do about the former, short of simply making i- and d-cache
coherent.
Which is a good idea, but not everyone does it.
Andrew.
Michael S <[email protected]> writes:
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC examples rely
on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to each
other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction non-privilaged
and pretty much non-restricted by memory range/page attributes was a
mistake, but that mistake can't be fixed without breaking things.
Considering that CLFLUSH exists since very early 2000s, it is
understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you don't
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol. This
would turn CLFLUSH into a noop (at least as far as writing to DRAM is concerned, the ordering constraints may still be relevant), so there
is a way to fix this mistake (if it is one).
On Wed, 31 Jan 2024 17:17:21 GMT
[email protected] (Anton Ertl) wrote:
Michael S <[email protected]> writes:
I have very little to add to very good response by Anton.Ideally caches are fully transparent microarchitecture, then you don't
That little addition is: the most if not all Rowhammer POC examples
rely on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to
each other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to the
same cache line.1 They are not ordered with respect to executions of
CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
need stuff like CLFLUSH. My guess is that CLFLUSH is there for
getting DRAM up-to-date for DMA from I/O devices. An alternative
would be to let the memory controller remember which lines are
modified, and if the I/O device asks for that line, get the up-to-date
data from the cache line using the cache-consistency protocol.
Considering that CLFLUSH was introduced by Intel in year 2000 or 2001
and that at that time all Intel's PCI/AGP root hubs were already fully I/O-coherent for several years, I find your theory unlikely.
Myself, I don't know the original reason, but I do know a use case
where CLFLUSH, while not strictly necessary, simplifies things greatly
- entering deep sleep state in which CPU caches are powered down and
DRAM put in self-refresh mode.
Anton Ertl wrote:
Michael S <[email protected]> writes:
I have very little to add to very good response by Anton.
That little addition is: the most if not all Rowhammer POC
examples rely on CLFLUSH. That's what the manual says about it:
"Executions of the CLFLUSH instruction are ordered with respect to
each other and with respect to writes, locked read-modify-write
instructions, fence instructions, and executions of CLFLUSHOPT to
the same cache line.1 They are not ordered with respect to
executions of CLFLUSHOPT to different cache lines."
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
Ideally caches are fully transparent microarchitecture, then you
don't need stuff like CLFLUSH. My guess is that CLFLUSH is there
for getting DRAM up-to-date for DMA from I/O devices. An
alternative would be to let the memory controller remember which
lines are modified, and if the I/O device asks for that line, get
the up-to-date data from the cache line using the cache-consistency protocol. This would turn CLFLUSH into a noop (at least as far as
writing to DRAM is concerned, the ordering constraints may still be relevant), so there is a way to fix this mistake (if it is one).
The text in Intel Vol 1 Architecture manual indicates they viewed all
these cache control instruction PREFETCH, CLFLUSH, and CLFLUSHOPT
as part of SSE for use by graphics applications that want to take
manual control of their caching and minimize cache pollution.
Note that the non-temporal move instructions MOVNTxx were also part of
that SSE bunch and could also be used to force a write to DRAM.
On Thu, 01 Feb 2024 09:39:13 +0000
[email protected]d wrote:
Michael S <[email protected]> wrote:
By now, it seems obvious that making CLFLUSH instruction
non-privilaged and pretty much non-restricted by memory range/page
attributes was a mistake, but that mistake can't be fixed without
breaking things. Considering that CLFLUSH exists since very early
2000s, it is understandable.
IIRC, ARMv8 did the same mistake a decade later. It is less
understandable.
For Arm, with its non-coherent data and instruction caches, you need
some way to flush dcache to the point of unification in order to make
instruction changes visible. Also, regardless of icache coherence,
when using non-volatile memory you need an efficient way to flush
dcache to the point of peristence. You need that in order to make
sure that a transaction has been written to a log.
With the latter, you could restrict dcache flushes to pages with a
particular non-volatile attribute. I don't think there's anything you
can do about the former, short of simply making i- and d-cache
coherent.
For the later, privileged flush instruction sounds sufficient.
For the former, ARMv8 appears to have a special instruction (or you can
call it a special variant of DC instruction) - Clean by virtual address
to point of unification (DC CVAU). This instruction alone would not
make RH attack much easier. The problem is that privilagability of this instruction controlled by the same bit as privilagability of two much
more dangerous variations of DC (DC CVAC and DC CIVAC).
Anton Ertl wrote:
Rowhammer happens when you beat on the same cache line multiple times {causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.
So, the trick is to detect the RowHammering and insert refresh commands.
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Michael S wrote:
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.
The problem here is the fact that DRAMs do not use linear decoders, so address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.
The DRAM decoders are faster and smaller when there is a grey-like-code imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
MitchAlsup wrote:
Anton Ertl wrote:
Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.
So, the trick is to detect the RowHammering and insert refresh commands.
It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.
Also it may be data dependent - 0's bleed into adjacent 1's and 1's into 0's.
And the threshold when it triggers has been changing as drams become more dense. In 2014 when this was first encountered it took 139K activations.
By 2020 that was down to 4.8K.
So figuring out how much a row has been damaged is complicated,
and the window for detecting it is getting smaller.
EricP <[email protected]> schrieb:
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
That would look... interesting.
How are large OR gates actually constructed? I would assume that an eight-input OR gate could look something like
nand(nor(a,b),nor(c,d),nor(e,f),nor(g,h))
which would reduce the number of inputs by a factor of 2^3, so
seven layers of these OR gates would be needed.
Wiring would be interesting as well...
EricP wrote:
MitchAlsup wrote:
Anton Ertl wrote:
Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you
cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.
So, the trick is to detect the RowHammering and insert refresh commands.
It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.
My understanding is that RowHammer has to access the same row multiple
times
to disrupt bits in an adjacent row. This sounds like a charge sharing problem.
A long time ago We found a problem with one manufactures SRAM when the same row was hit >6,000 times, there was enough charge sharing that the
adjacent dynamic word decoder also fired so we had 2 or 3 word lines
active at the same time. We encountered this when a LD missed the cache
and was sent down
through NorthBridge, SouthBridge, onto another bus, finally out to the
device
and back, while the CPU was continuing to read the ICache every cycle.
My limited understanding of RowPress is that you should not keep the Row
open
for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My
bet is
that this is a leakage issue on the bit line made sensitive by the word
line.
Also it may be data dependent - 0's bleed into adjacent 1's and 1's
into 0's.
DRAMs are funny like this. Adjacent bit lines store data differently. Even bits store 0 as 0 and 1 as 1 while odd cells store 0 as 1 and 1 as 0. They
do this so the sense amplified has a differential to sense, either the even cell or the odd cell is asserted on the bit line pair and the sense amp
then
has a differential to sense. One line goes up a little or down a little
while
the other bit line stays where it is.
And the threshold when it triggers has been changing as drams become more
dense. In 2014 when this was first encountered it took 139K activations.
By 2020 that was down to 4.8K.
So figuring out how much a row has been damaged is complicated,
and the window for detecting it is getting smaller.
MitchAlsup wrote:
A long time ago We found a problem with one manufactures SRAM when the
same
row was hit >6,000 times, there was enough charge sharing that the
adjacent dynamic word decoder also fired so we had 2 or 3 word lines
active at the same time. We encountered this when a LD missed the cache
and was sent down
through NorthBridge, SouthBridge, onto another bus, finally out to the
device
and back, while the CPU was continuing to read the ICache every cycle.
I think of this as aging: each activation ages the rows up to some distance by amounts depending on the distance due to charge migration.
Originally it was found by activating rows immediately adjacent to the
victim but then they looked and found it further out to +-4 rows.
This effect appears to be called the Rowhammer "blast radius".
This paper is from 2023 but I'm sure I've seen mention of this effect
before but not called blast radius.
BLASTER: Characterizing the Blast Radius of Rowhammer, 2023 https://www.research-collection.ethz.ch/handle/20.500.11850/617284 https://dramsec.ethz.ch/papers/blaster.pdf
"In particular, we show for the first time that BLASTER significantly
reduces the number of necessary activations to the victim-adjacent
aggressors using other aggressor rows that are up to four rows away
from the victim."
MitchAlsup wrote:
Michael S wrote:
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be
very difficult in practice.
The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.
The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.
I was wondering if each row could have "canary" bit,
a specially weakened bit that always flips early.
This would also intrinsically handle the cases of effects
falling off over the +-3 adjacent rows.
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
On Wed, 31 Jan 2024 17:17:21 GMT
[email protected] (Anton Ertl) wrote:
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go.
IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent. POC authors
typically run lengthy tests in order to figure it out.
...With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
They cared enough to implement the simplest of proposed solutions - TRR.
Yes, it was quickly found insufficient, but at least there was a >demonstration of good intentions.
EricP wrote:
MitchAlsup wrote:
Anton Ertl wrote:
Rowhammer happens when you beat on the same cache line multiple times
{causing a charge sharing problem on the word lines. Every time you
cause
the DRAM to precharge (deActivate) you lose the count on how many times
you have to bang on the same word line to disrupt the stored cells.
So, the trick is to detect the RowHammering and insert refresh commands.
It's not just the immediately physically adjacent rows -
I think I read that the effect falls off for up to +-3 rows away.
My understanding is that RowHammer has to access the same row multiple
times
to disrupt bits in an adjacent row. This sounds like a charge sharing problem.
A long time ago We found a problem with one manufactures SRAM when the same row was hit >6,000 times, there was enough charge sharing that the
adjacent dynamic word decoder also fired so we had 2 or 3 word lines
active at the same time. We encountered this when a LD missed the cache
and was sent down
through NorthBridge, SouthBridge, onto another bus, finally out to the
device
and back, while the CPU was continuing to read the ICache every cycle.
My limited understanding of RowPress is that you should not keep the Row
open
for more than a page of data transfer (about ¼ of 7.8µs DDR4 limit). My
bet is
that this is a leakage issue on the bit line made sensitive by the word
line.
Michael S <[email protected]> writes:
On Wed, 31 Jan 2024 17:17:21 GMT
[email protected] (Anton Ertl) wrote:
The first paper on Rowhammer already outlined how the memory
controller could count how often adjacent DRAM rows are accessed and
thus weaken the row under consideration. This approach needs a little
adjustment for Double Rowhammer and not immediately neighbouring rows,
but otherwise seems to me to be the way to go.
IMHO, all thise solutions are pure fantasy, because memory controller
does not even know which rows are physically adjacent. POC authors >>typically run lengthy tests in order to figure it out.
Given that the attackers can find out, it is just a lack of
communication between DRAM manufacturers and memory controller
manufacturers that result in that ignorance. Not a valid excuse.
There is a standardization committee (JEDEC) that documents how
various DRAM types are accessed, refreshed etc. They put information
about that (and about RAM overclocking (XMP, Expo)) in the SPD ROMs of
the DIMMs, so they can also put information about line adjacency
there.
....With autorefresh in
the DRAM devices these days, the DRAM manufacturers could implement
this on their own, without needing to coordinate with memory
controller designers. But apparently they think that the customers
don't care, so they can save the expense.
They cared enough to implement the simplest of proposed solutions - TRR. >>Yes, it was quickly found insufficient, but at least there was a >>demonstration of good intentions.
Yes. However, looking at Table III of <https://comsec.ethz.ch/wp-content/files/blacksmith_sp22.pdf>, there
seems to be significant differences between manufacturers A and D on
one hand, and B and C on the other, with exploits taking much longer
for B and C, and failing in some cases.
One may wonder if the DRAM manufacturers could have put their
physicists to the task of identifying the conditions under which bit
flips can occur, and identify the refreshes that are at least
necessary to prevent these conditions from occuring. If they have not
done so, or if they have not implemented the resulting recommendations
(or passed them to the memory controller people), a certain amount of
blame rests on them.
Anyway, never mind the blame, looking into the future, I find it
worrying that I did not find any mention of Rowhammer protection in
the specs of DIMMs when I last looked.
- anton
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following difficulties:
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
* To flip a bit in one direction, AFAIK the hammering rows have to
have a specific content. I guess with a blast radius of 4 rows on
each side, you could have 4 columns. Each row has a canary in one
of these columns and the three adjacent bits in this column are
attacker bits that have the value that is useful for effecting a bit
flip in a canary. Probably a more refined variant of this idea
would be necessary is necessary to deal with diagonal influence and
the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
in this thread.
- anton
Anton Ertl wrote:
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following
difficulties:
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
You know what its value should be and you raise hell when it is not as >expected.
[email protected] (MitchAlsup) writes:
Anton Ertl wrote:
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following
difficulties:
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
You know what its value should be and you raise hell when it is not as >>expected.
So that is about detecting Rowhammer after the fact. Yes, you could
do that when the row is refreshed. The only problem is that by then
the attacker could have extracted the secret(s) with the
Rowhammer-based attack. Better than nothing, but still not a very
attractive approach.
I prefer a solution that detects that a row might suffer a bit flip
after several more accesses, and refreshes the row before that happens.
And I don't think that this can be implemented with an analog canary
that works like a DRAM cell; but I am not a solid-state physicist,
maybe there is a way.
- anton
EricP <[email protected]> writes:
MitchAlsup wrote:
Michael S wrote:A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
Original RH required very high hammering rate that certainly can't beThe problem here is the fact that DRAMs do not use linear decoders, so
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be >>>> very difficult in practice.
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.
The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
So having a counter for each row is impractical.
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Admittedly, if you just update the counter for a specific row and the
refresh all rows in the blast radius when a limit is reached, you
may get many more refreshes than the minimum necessary, but given that
normal programs usually do not hammer specific row ranges, the
additional refreshes may still be relatively few in non-attack
situations (and when being attacked, you prefer lower DRAM performance
to a successful attack).
Alternatively, a kind of cache could be used. Keep counts of N most
recently accessed rows, remove the row on refresh; when accessing a
row that has not been in the cache, evict the entry for the row with
the lowest count C, and set the count of the loaded row to C+1. When
a count (or ensemble of counts) reaches the limit, refresh every row.
This would take much less memory, but require finding the entry with
the lowest count. By dividing the cache into sets, this becomes more realistic; upon reaching a limit, only the rows in the blast radius of
the lines in a set need to be refreshed.
I was wondering if each row could have "canary" bit,
a specially weakened bit that always flips early.
This would also intrinsically handle the cases of effects
falling off over the +-3 adjacent rows.
Then a giant 2 million input OR gate would tell us if any row's
canary had flipped.
Yes, doing it in analog has its charms. However, I see the following difficulties:
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
* To flip a bit in one direction, AFAIK the hammering rows have to
have a specific content. I guess with a blast radius of 4 rows on
each side, you could have 4 columns. Each row has a canary in one
of these columns and the three adjacent bits in this column are
attacker bits that have the value that is useful for effecting a bit
flip in a canary. Probably a more refined variant of this idea
would be necessary is necessary to deal with diagonal influence and
the non-uniform encoding of 0 and 1 in the DRAMs discussed somewhere
in this thread.
- anton
EricP <[email protected]> writes:
MitchAlsup wrote:
Michael S wrote:
Original RH required very high hammering rate that certainly can't be
achieved by playing with associativity of L3 cache.
Newer multiside hammering probably can do it in theory, but it would be >>>> very difficult in practice.
The problem here is the fact that DRAMs do not use linear decoders, so
address X and address X+1 do not necessarily shared paired word lines.
The word lines could be as far as ½ the block away from each other.
The DRAM decoders are faster and smaller when there is a grey-like-code
imposed on the logical-address to physical-word-line. This also happens
in SRAM decoders. Going back and looking at the most used logical to
physical mapping shows that while X and X+1 can (occasionally) be side
by side, X, X+1 and X+2 should never be 3 words lines in a row.
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Anton Ertl wrote:
EricP <[email protected]> writes:
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.
So having a counter for each row is impractical.
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
You are comparing a 16-bit incrementor and its associated flip-flop
with a single transistor divided by the number of them in a word.
My guess is that you are off by 20× (should be close to 4%)
Anton Ertl wrote:
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Admittedly, if you just update the counter for a specific row and the
refresh all rows in the blast radius when a limit is reached, you
may get many more refreshes than the minimum necessary, but given that
normal programs usually do not hammer specific row ranges, the
additional refreshes may still be relatively few in non-attack
situations (and when being attacked, you prefer lower DRAM performance
to a successful attack).
They said that the current threshold for causing flips in an immediate >neighbor is 4800 activations, but with a blast radius of +-4 that
can be in any of the 8 neighbors, so your counter threshold will have
to trigger refresh at 1/8 of that level or every 600 activations.
And as the dram features get smaller that threshold number will go down
and probably the blast radius will go up. So this could have scaling
issues in the future.
Alternatively, a kind of cache could be used. Keep counts of N most
recently accessed rows, remove the row on refresh; when accessing a
row that has not been in the cache, evict the entry for the row with
the lowest count C, and set the count of the loaded row to C+1. When
a count (or ensemble of counts) reaches the limit, refresh every row.
That would be a CAM or assoc sram and would have to hold a large
number of entries. This would have to be in the memory controller.
* How do you measure whether a bit has flipped without refreshing it
and thus resetting the canary?
The canary would have to be a little more complicated than a standard
storage cell because it has to compare the cell to the expected value
Sooner or later, designers will have to come to the realization that
an external DRAM controller can never guarantee everything every DRAM >actually needs to retain data under all conditions, and the DRAMs
are going to have to change the interface such that requests flow
in and results flow out based on the DRAM internal controller--much
like that of a SATA disk drive.
Let us face it, the DDR-6 interface model is based on the 16K-bit
DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
double data rated, and each step added address bits to RAS and CAS.
I suspect when this happens, the DRAMs will partition the inbound
address into 3 or 4 sections, and use each section independently >Bank-Row-Column or block-bank-row-column.
In addition each building block will be internally self timed, no
external need to refresh the bank-row, and the only non access
command in the arsenal is power-down and power-up.
[email protected] (MitchAlsup) writes:
Sooner or later, designers will have to come to the realization that
an external DRAM controller can never guarantee everything every DRAM >>actually needs to retain data under all conditions, and the DRAMs
are going to have to change the interface such that requests flow
in and results flow out based on the DRAM internal controller--much
like that of a SATA disk drive.
Let us face it, the DDR-6 interface model is based on the 16K-bit
DRAM chips from about 1979: RAS and CAS, it got speed up, pipelined,
double data rated, and each step added address bits to RAS and CAS.
I don't know about DDR6, but the DDR5 command interface is
significantly more complex <https://en.wikipedia.org/wiki/DDR5#Command_encoding> than early
asynchronous DRAM.
I suspect when this happens, the DRAMs will partition the inbound
address into 3 or 4 sections, and use each section independently >>Bank-Row-Column or block-bank-row-column.
Looking at the commands from the link above, Activate already
transfers the row in two pieces, and the read and write are also
transferred in two pieces.
In addition each building block will be internally self timed, no
external need to refresh the bank-row, and the only non access
command in the arsenal is power-down and power-up.
Self-refresh is already there, but AFAIK only used when processing is suspended.
However, there are many commands, many more than in the 16kx1 DRAMs of
old. What would make them go in the direction of simplifying the
interface?
The hardest part these days seems to be getting the high
transfer rates to work, the rest of the interface is probably
comparatively easy.
- anton
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
EricP <[email protected]> writes:You are comparing a 16-bit incrementor and its associated flip-flop
A 16 Gb dram with 8kb rows has 2^21 = 2 million rows.A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
So having a counter for each row is impractical.
with a single transistor divided by the number of them in a word.
I was thinking about counting each access only when the cache line is accessed. Then there needs to be only one incrementor per bank, and
the counter can be stored in DRAM like the payload data.
But thinking about it again, I wonder how counters would be reset.
Maybe, when the counter reaches the limit, all lines in its blast
radius are refereshed, and the counter of the present line is reset to
0.
Another disadvantage would be that we have to make decisions about
possible rowhammering only based on one counter, and have to trigger refreshes of all lines in the blast radius based on worst-case
scenarios (i.e., assuming that other rows in the blast radius have any
count up to the limit).
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger count. That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the counters
so the counts are not cumulative.
That overhead is only going to grow as dram density increases.
EricP <[email protected]> writes:
Anton Ertl wrote:
A (say) 16-bit counter for each 8Kb row would be a 0.2% overhead.
Admittedly, if you just update the counter for a specific row and the
refresh all rows in the blast radius when a limit is reached, you
may get many more refreshes than the minimum necessary, but given that
normal programs usually do not hammer specific row ranges, the
additional refreshes may still be relatively few in non-attack
situations (and when being attacked, you prefer lower DRAM performance
to a successful attack).
They said that the current threshold for causing flips in an immediate >>neighbor is 4800 activations, but with a blast radius of +-4 that
can be in any of the 8 neighbors, so your counter threshold will have
to trigger refresh at 1/8 of that level or every 600 activations.
So only 10 bits of counter are necessary, reducing the overhead to
0.125%:-).
And as the dram features get smaller that threshold number will go down
and probably the blast radius will go up. So this could have scaling
issues in the future.
Yes.
- anton
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC repairs ?!?
with the potential for a few ECC repair fails ?!!?
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC repairs ?!?
with the potential for a few ECC repair fails ?!!?
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger
count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the counters
so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then refreshing
every row
is worse for data retention than spreading the refreshes out over the
64ms max
interval rather evenly.
MitchAlsup1 wrote:
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC repairs ?!?
with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600 trigger
count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the counters
so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then refreshing
every row
is worse for data retention than spreading the refreshes out over the
64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by row?
I would expect doing so would introduce big stalls into memory access.
64 ms / 8192 rows per block = 7.8125 us row interval.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600
trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the
counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60� ms and then
refreshing every row
is worse for data retention than spreading the refreshes out over
the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by
row? I would expect doing so would introduce big stalls into memory
access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7�s and if the
back was active it would allow REF to slip. But on a second timer
event it would interrupt data transfer and induce 2 refreshes to
catch up. In general, this worked well as it almost never happened.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5�s.
When one changes page boundaries the HoB address bits are essentially randomized by the TLB:: why not just close the row at that point ?
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600
trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the
counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60� ms and then
refreshing every row
is worse for data retention than spreading the refreshes out over
the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by
row? I would expect doing so would introduce big stalls into memory
access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7�s and if the
back was active it would allow REF to slip. But on a second timer
event it would interrupt data transfer and induce 2 refreshes to
catch up. In general, this worked well as it almost never happened.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5�s.
When one changes page boundaries the HoB address bits are essentially randomized by the TLB:: why not just close the row at that point ?
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
On Sun, 11 Feb 2024 19:57:34 +0000
[email protected] (MitchAlsup1) wrote:
Because memory controller is not aware of CPU page boundaries.
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
Michael S <[email protected]> writes:
On Sun, 11 Feb 2024 19:57:34 +0000
[email protected] (MitchAlsup1) wrote:
Because memory controller is not aware of CPU page boundaries.
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
AArch64 supports translation granules of 4k, 16k and 64k. 4K
and 64K are the most common. While the architecture defines
16k, an implementation is free to not support it and I'm not aware of
any widespread usage.
On Sun, 11 Feb 2024 19:57:34 +0000
[email protected] (MitchAlsup1) wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary to
prevent Rowhammer, but that approach may still be good enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600
trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3% overhead.
And the whole dram is refreshed every 64 ms reseting all the
counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then
refreshing every row
is worse for data retention than spreading the refreshes out over
the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically by
row? I would expect doing so would introduce big stalls into memory
access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and if the
back was active it would allow REF to slip. But on a second timer
event it would interrupt data transfer and induce 2 refreshes to
catch up. In general, this worked well as it almost never happened.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
When one changes page boundaries the HoB address bits are essentially
randomized by the TLB:: why not just close the row at that point ?
Because memory controller is not aware of CPU page boundaries.
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
Michael S wrote:
On Sun, 11 Feb 2024 19:57:34 +0000
[email protected] (MitchAlsup1) wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary
to prevent Rowhammer, but that approach may still be good
enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600
trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3%
overhead. And the whole dram is refreshed every 64 ms reseting
all the counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60� ms and then
refreshing every row
is worse for data retention than spreading the refreshes out
over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically
by row? I would expect doing so would introduce big stalls into
memory access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7�s and if
the back was active it would allow REF to slip. But on a second
timer event it would interrupt data transfer and induce 2
refreshes to catch up. In general, this worked well as it almost
never happened.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5�s.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
When one changes page boundaries the HoB address bits are
essentially randomized by the TLB:: why not just close the row at
that point ?
Because memory controller is not aware of CPU page boundaries.
Bits<19:12> changed. How hard is that to detect ??
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
Neither of which prevent closing the row to avoid memory retention
issues.
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
On Mon, 12 Feb 2024 22:45:08 +0000
[email protected] (MitchAlsup1) wrote:
Michael S wrote:
On Sun, 11 Feb 2024 19:57:34 +0000
[email protected] (MitchAlsup1) wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than necessary
to prevent Rowhammer, but that approach may still be good
enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails ?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 = 600
trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3%
overhead. And the whole dram is refreshed every 64 ms reseting
all the counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60± ms and then
refreshing every row
is worse for data retention than spreading the refreshes out
over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of periodically
by row? I would expect doing so would introduce big stalls into
memory access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7µs and if
the back was active it would allow REF to slip. But on a second
timer event it would interrupt data transfer and induce 2
refreshes to catch up. In general, this worked well as it almost
never happened.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5µs.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
When one changes page boundaries the HoB address bits are
essentially randomized by the TLB:: why not just close the row at
that point ?
Because memory controller is not aware of CPU page boundaries.
Bits<19:12> changed. How hard is that to detect ??
Do you always answer one statement before reading the next statement?
Besides, in aarch64 world 16KB pages are rather common. And in x86
world "transparent huge pages" are rather common.
Neither of which prevent closing the row to avoid memory retention
issues.
What scenario of attack do you have in mind?
I would think that neither in "classic" multi-side Row Hammer nor in Row Press attacker has to cross CPU page boundaries. If he (attacker)
happens to know that memory controller likes to close DRAMraws on any particular address boundary, then he can easily avoid accessing last
cache line before that particular boundary.
BTW, all this attacks (or should I say, all this POCs, because I don't
think that somebody ever caught real RH/RP attack launched by real bad
guy) rather heavily depend on big or huge pages. They are close to
impossible with small pages, even when "small" means 16 KB rather than
4 KB.
verses 8192*50 ns = 409.6 us memory stall every 64 ms.
Michael S wrote:
On Mon, 12 Feb 2024 22:45:08 +0000
[email protected] (MitchAlsup1) wrote:
Michael S wrote:
On Sun, 11 Feb 2024 19:57:34 +0000
[email protected] (MitchAlsup1) wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
Anton Ertl wrote:
[email protected] (MitchAlsup1) writes:
Anton Ertl wrote:
Both disadvantages lead to far more refreshes than
necessary to prevent Rowhammer, but that approach may
still be good enough.
Would you rather have a few more refreshes or a few more ECC
repairs ?!? with the potential for a few ECC repair fails
?!!?
I believe Rowhammer and RowPress can flip many bits at once.
Too many for SECDED.
Lets see how bad this is.
The single line threshold of 4800 and blast radius of 8 =
600 trigger count.
That triggers an extra 8 row refreshes, so 8/600 = 1.3%
overhead. And the whole dram is refreshed every 64 ms
reseting all the counters so the counts are not cumulative.
I think what RowPress tells us that waiting 60� ms and then
refreshing every row
is worse for data retention than spreading the refreshes out
over the 64ms max
interval rather evenly.
Would any memory controller that would do that,
refresh the whole dram in one big burst instead of
periodically by row? I would expect doing so would introduce
big stalls into memory access.
64 ms / 8192 rows per block = 7.8125 us row interval.
My DRAM controller (Opteron RevF) had a timer set about 7�s and
if the back was active it would allow REF to slip. But on a
second timer event it would interrupt data transfer and induce 2
refreshes to catch up. In general, this worked well as it almost
never happened.
Lets say 50 ns row refresh time.
So thats either 50 ns every 7.8 us
A DDR5 at 6GBits/s transmits a 4096 byte page in 5�s.
DDR5 channel is 32-bit.
4096B/(4B/T * 6e9 T/s) = 0.171 usec.
Or for more 0.204 usec for more realistic rate of 5e9 T/s
When one changes page boundaries the HoB address bits are
essentially randomized by the TLB:: why not just close the row
at that point ?
Because memory controller is not aware of CPU page boundaries.
Bits<19:12> changed. How hard is that to detect ??
Do you always answer one statement before reading the next
statement?
I actually wrote the above after writing the below.
Besides, in aarch64 world 16KB pages are rather common. And in
x86 world "transparent huge pages" are rather common.
Neither of which prevent closing the row to avoid memory retention
issues.
What scenario of attack do you have in mind?
RowPress depends on keeping the row open too long--clearly evident in
the charts in the document.
I would think that neither in "classic" multi-side Row Hammer nor
in Row Press attacker has to cross CPU page boundaries. If he
(attacker) happens to know that memory controller likes to close
DRAMraws on any particular address boundary, then he can easily
avoid accessing last cache line before that particular boundary.
RowHammer depends on closing the row too often.
Performance (single CPU) depends on allowing the open row to service
several pending requests streaming data at CAS access speeds.
There is a balance to be found by preventing RowHammer from opening
nearby rows too often and in preventing RowPress from holding them
open for too long.
I happen to think (without evidence beyond that of the rRowPress
document) that the balance is distributing refreshes evenly across
the refresh interval (as evidenced in the charts in RowPress
document. It ends up that with modern DDR this enables about 4096
bytes to be read/written to a row before closing it (within a factor
of 2-4).
On Tue, 13 Feb 2024 00:19:18 +0000
[email protected] (MitchAlsup1) wrote:
RowPress depends on keeping the row open too long--clearly evident in
the charts in the document.
Clarification for casual observers that didn't bother to read Row Press paper: RowPress attack does not depends on keeping row open
continuously.
Short interruptions actually greatly improve effectiveness of attack significantly increasing BER for a given duration of attack. After
all, RowPress *is* a variant of RowHammer.
For a given interruption rate, longer interruptions reduce effectiveness
of attack, but not dramatically so. For example, for most practically important interruption rate of 128 KHz (period=7.81 usec) increasing
duration of off interval from absolute minimum allowed by protocol
(~50ns) to 2 usec reduces efficiency of attack only by factor of 2 o 3.
Michael S wrote:
On Tue, 13 Feb 2024 00:19:18 +0000
[email protected] (MitchAlsup1) wrote:
RowPress depends on keeping the row open too long--clearly evident
in the charts in the document.
Clarification for casual observers that didn't bother to read Row
Press paper: RowPress attack does not depends on keeping row open continuously.
Short interruptions actually greatly improve effectiveness of attack significantly increasing BER for a given duration of attack. After
all, RowPress *is* a variant of RowHammer.
RowPress documents that keeping the aggressor row open longer lowers
the limit on the adjacent rows before opens (RowHammers) causes bit
flips.
Also the paper notes that DRAM manufacturers, eg Micron and
Samsung, already document that keeping a row open longer can cause read-disturbance. What's new is the paper documents the interaction
between row activation time and the subsequent number of opens
(RowHammers) needed to flip a bit.
Also note that different bits are susceptible to RowPress and
RowHammer. See section 4.3
RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023 https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf
"RowPress breaks memory isolation by keeping a DRAM row open for a
long period of time, which disturbs physically nearby rows enough to
cause bitflips. We show that RowPress amplifies DRAM’s vulnerability
to read-disturb attacks by significantly reducing the number of row activations needed to induce a bitflip by one to two orders of
magnitude under realistic conditions. In extreme cases, RowPress
induces bitflips in a DRAM row when an adjacent row is activated only
once."
"We show that keeping a DRAM row (i.e., aggressor row) open for a long
period of time (i.e., a large aggressor row ON time, tAggON) disturbs physically nearby DRAM rows. Doing so induces bitflips in the victim
row without requiring (tens of) thousands of activations to the
aggressor row."
For a given interruption rate, longer interruptions reduce
effectiveness of attack, but not dramatically so. For example, for
most practically important interruption rate of 128 KHz
(period=7.81 usec) increasing duration of off interval from
absolute minimum allowed by protocol (~50ns) to 2 usec reduces
efficiency of attack only by factor of 2 o 3.
Reduced by a factor of up to 363. Under figure 1.
"We observe that as tAggON increases, compared to the most effective RowHammer pattern, the most effective Row-Press pattern reduces ACmin
1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
refresh interval (7.8 μs),
2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
the maximum allowed tAggON, and
3) down to only one activation for an extreme tAggON of 30 ms
(highlighted by dashed red boxes).
Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
increases."
On Tue, 13 Feb 2024 00:19:18 +0000
[email protected] (MitchAlsup1) wrote:
RowHammer depends on closing the row too often.
Yes, except that it is unknown whether major RH impact is done by
closing the row or by opening it. The later is more likely. But since
the rate of opening and closing is the same, this finer difference is
not important.
Michael S wrote:
On Tue, 13 Feb 2024 00:19:18 +0000
[email protected] (MitchAlsup1) wrote:
RowHammer depends on closing the row too often.
Yes, except that it is unknown whether major RH impact is done by
closing the row or by opening it. The later is more likely. But
since the rate of opening and closing is the same, this finer
difference is not important.
A Deeper Look into RowHammers Sensitivities Experimental Analysis
of Real DRAM Chips and Implications on Future Attacks and Defenses,
2021 https://arxiv.org/pdf/2110.10291
That paper pre-dates the RowPress one and notes:
"6.1 Impact of Aggressor Row�s On-Time
Obsv. 8. As the aggressor row stays active longer (i.e., tAggON
increases), more DRAM cells experience RowHammer bit flips and they experience RowHammer bit flips at lower hammer counts."
Obsv. 9. RowHammer vulnerability consistently worsens as tAggON
increases in DRAM chips from all four manufacturers.
6.2 Impact of Aggressor Row�s Off-Time
Obsv. 10. As the bank stays precharged longer (i.e., tAggOFF
increases), fewer DRAM cells experience RowHammer bit flips and they experience RowHammer bit flips at higher hammer counts.
Obsv. 11. RowHammer vulnerability consistently reduces as
tAggOFF increases in DRAM chips from all four manufacturers."
On Tue, 13 Feb 2024 11:24:10 -0500
EricP <[email protected]> wrote:
Michael S wrote:
On Tue, 13 Feb 2024 00:19:18 +0000RowPress documents that keeping the aggressor row open longer lowers
[email protected] (MitchAlsup1) wrote:
RowPress depends on keeping the row open too long--clearly evidentClarification for casual observers that didn't bother to read Row
in the charts in the document.
Press paper: RowPress attack does not depends on keeping row open
continuously.
Short interruptions actually greatly improve effectiveness of attack
significantly increasing BER for a given duration of attack. After
all, RowPress *is* a variant of RowHammer.
the limit on the adjacent rows before opens (RowHammers) causes bit
flips.
Correct, but irrelevant.
Also the paper notes that DRAM manufacturers, eg Micron and
Samsung, already document that keeping a row open longer can cause
read-disturbance. What's new is the paper documents the interaction
between row activation time and the subsequent number of opens
(RowHammers) needed to flip a bit.
Correct and relevant, but not to the issue at hand which is criticism
of Mitch's ideas of mitigation.
Also note that different bits are susceptible to RowPress and
RowHammer. See section 4.3
RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf
For a given interruption rate, longer interruptions reduceReduced by a factor of up to 363. Under figure 1.
effectiveness of attack, but not dramatically so. For example, for
most practically important interruption rate of 128 KHz
(period=7.81 usec) increasing duration of off interval from
absolute minimum allowed by protocol (~50ns) to 2 usec reduces
efficiency of attack only by factor of 2 o 3.
"We observe that as tAggON increases, compared to the most effective
RowHammer pattern, the most effective Row-Press pattern reduces ACmin
1) by 17.6× on average (up to 40.7×) when tAggON is as large as the
refresh interval (7.8 μs),
2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
the maximum allowed tAggON, and
3) down to only one activation for an extreme tAggON of 30 ms
(highlighted by dashed red boxes).
Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
increases."
ACmin by itself is a wrong measure of efficiency of attack.
The right measure is reciprocal of the total duration of attack.
At any given duty cycle reciprocal of the total duration of attack
grows with increased rate of interruptions (a.k.a. hammering rate).
The general trend is the same as for all other RH variants, the only difference that dependency on hammering rate is somewhat weaker.
Relatively weak influence of duty cycle itself is shown in figure 22.
The practical significance of RowPress is due to two factors.
(1) is the factor is the one you mentioned above - it can flip
different bits from those flippable by other RH variants.
(2) is that it is not affected at all by DDR4 TRR
attempt of mitigation.
The third, less important factor is that RowPress appears quite robust
to differences between major manufacturers.
However, one should not overlook that efficiency of RowPress attacks
when measured by the most important criterion of BER per duration of
attack is many times lower than earlier techniques of double-sided and multi-sided hammering.
Michael S wrote:
On Tue, 13 Feb 2024 11:24:10 -0500
EricP <[email protected]> wrote:
Michael S wrote:
On Tue, 13 Feb 2024 00:19:18 +0000RowPress documents that keeping the aggressor row open longer
[email protected] (MitchAlsup1) wrote:
RowPress depends on keeping the row open too long--clearlyClarification for casual observers that didn't bother to read Row
evident in the charts in the document.
Press paper: RowPress attack does not depends on keeping row open
continuously.
Short interruptions actually greatly improve effectiveness of
attack significantly increasing BER for a given duration of
attack. After all, RowPress *is* a variant of RowHammer.
lowers the limit on the adjacent rows before opens (RowHammers)
causes bit flips.
Correct, but irrelevant.
It was kinda the whole point of the RowPress paper.
Also the paper notes that DRAM manufacturers, eg Micron and
Samsung, already document that keeping a row open longer can cause
read-disturbance. What's new is the paper documents the interaction
between row activation time and the subsequent number of opens
(RowHammers) needed to flip a bit.
Correct and relevant, but not to the issue at hand which is
criticism of Mitch's ideas of mitigation.
Also note that different bits are susceptible to RowPress and
RowHammer. See section 4.3
RowPress Amplifying Read Disturbance in Modern DRAM Chips, 2023
https://people.inf.ethz.ch/omutlu/pub/RowPress_isca23.pdf
I just found out that there are two different versions of the RowPress
paper and I was looking at the older one. The updated version is:
RowPress: Amplifying Read Disturbance in Modern DRAM Chips, 2023 https://arxiv.org/pdf/2306.17061.pdf
For a given interruption rate, longer interruptions reduceReduced by a factor of up to 363. Under figure 1.
effectiveness of attack, but not dramatically so. For example, for
most practically important interruption rate of 128 KHz
(period=7.81 usec) increasing duration of off interval from
absolute minimum allowed by protocol (~50ns) to 2 usec reduces
efficiency of attack only by factor of 2 o 3.
"We observe that as tAggON increases, compared to the most
effective RowHammer pattern, the most effective Row-Press pattern
reduces ACmin 1) by 17.6× on average (up to 40.7×) when tAggON is
as large as the refresh interval (7.8 μs),
2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
the maximum allowed tAggON, and
3) down to only one activation for an extreme tAggON of 30 ms
(highlighted by dashed red boxes).
Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
increases."
ACmin by itself is a wrong measure of efficiency of attack.
I'm not interested in the efficiency of the attack.
ACmin, the minimum absolute count of opens above which we lose data
is the number I'm interested in.
The right measure is reciprocal of the total duration of attack.
At any given duty cycle reciprocal of the total duration of attack
grows with increased rate of interruptions (a.k.a. hammering rate).
The general trend is the same as for all other RH variants, the only difference that dependency on hammering rate is somewhat weaker.
Relatively weak influence of duty cycle itself is shown in figure
22.
Looking at figure 22 on the arxiv version of the paper,
this is a completely different test. This test was to explain the
discrepancy between the RowPress results and the earlier cited papers.
BER is the fraction of DRAM cells in a DRAM row that experience
bitflips. Its a different measure because RowPress detects when ANY
data loss begins, not the fraction of lost data bits (efficiency)
after it kicks in.
Obsv 16 explains it, the BER for the bottom two lines,
which are the ones with a long total tA2A, goes up in all graphs
by between a factor of 10 to about 500, which is the RowPress effect.
To my eye what this test shows is the PRE phase may *heal* some of the damaging effects that the ACT phase causes, but only to a certain
point. Possibly the PRE phase scavenges the ACT hot injection
carriers.
The practical significance of RowPress is due to two factors.
(1) is the factor is the one you mentioned above - it can flip
different bits from those flippable by other RH variants.
(2) is that it is not affected at all by DDR4 TRR
attempt of mitigation.
I take away something completely different: there are multiple
interacting error mechanisms at work here. RowHammer and RowPress are
likely completely different physics and fixing one won't fix the
other.
It also suggests there may be other similar mechanisms waiting to be
found.
The third, less important factor is that RowPress appears quite
robust to differences between major manufacturers.
However, one should not overlook that efficiency of RowPress attacks
when measured by the most important criterion of BER per duration of
attack is many times lower than earlier techniques of double-sided
and multi-sided hammering.
For me the BER is irrelevant if it is above 0.0.
I want to know where the errors start which is ACmin.
On Wed, 14 Feb 2024 10:51:47 -0500
EricP <[email protected]> wrote:
Michael S wrote:
On Tue, 13 Feb 2024 11:24:10 -0500I'm not interested in the efficiency of the attack.
EricP <[email protected]> wrote:
ACmin by itself is a wrong measure of efficiency of attack.
"We observe that as tAggON increases, compared to the most
effective RowHammer pattern, the most effective Row-Press pattern
reduces ACmin 1) by 17.6× on average (up to 40.7×) when tAggON is
as large as the refresh interval (7.8 μs),
2) by 159.4× on average (up to 363.8×) when tAggON is 70.2 μs,
the maximum allowed tAggON, and
3) down to only one activation for an extreme tAggON of 30 ms
(highlighted by dashed red boxes).
Also see "Obsv. 1. RowPress significantly reduces ACmin as tAggON
increases."
ACmin, the minimum absolute count of opens above which we lose data
is the number I'm interested in.
You may be interested, but I don't understand why.
For me, the important thing is how much time it take until probability
of the flip become significant.
Suppose, attack (A) hammers at 5 MHz and has ACmin=5e4. Attack (B)
hammers at 0.13 MHz (typical for RP in real-world setup) and has
ACmin=3e3.
Then I'd say that attack (A) is 2.3 times more dangerous.
Back to real world, researchers demonstrated that multi-side
hammering can have ACmin that is significantly lower than our imaginary attack (A), so the only remaining question is how fast can we hammer
without triggering TRR. My 5MHz number probably hard to achieve for
attacker, but 2-3 MHz sound doable.
The right measure is reciprocal of the total duration of attack.Looking at figure 22 on the arxiv version of the paper,
At any given duty cycle reciprocal of the total duration of attack
grows with increased rate of interruptions (a.k.a. hammering rate).
The general trend is the same as for all other RH variants, the only
difference that dependency on hammering rate is somewhat weaker.
Relatively weak influence of duty cycle itself is shown in figure
22.
this is a completely different test. This test was to explain the
discrepancy between the RowPress results and the earlier cited papers.
BER is the fraction of DRAM cells in a DRAM row that experience
bitflips. Its a different measure because RowPress detects when ANY
data loss begins, not the fraction of lost data bits (efficiency)
after it kicks in.
Obsv 16 explains it, the BER for the bottom two lines,
which are the ones with a long total tA2A, goes up in all graphs
by between a factor of 10 to about 500, which is the RowPress effect.
To my eye what this test shows is the PRE phase may *heal* some of the
damaging effects that the ACT phase causes, but only to a certain
point. Possibly the PRE phase scavenges the ACT hot injection
carriers.
The practical significance of RowPress is due to two factors.I take away something completely different: there are multiple
(1) is the factor is the one you mentioned above - it can flip
different bits from those flippable by other RH variants.
(2) is that it is not affected at all by DDR4 TRR
attempt of mitigation.
interacting error mechanisms at work here. RowHammer and RowPress are
likely completely different physics and fixing one won't fix the
other.
Different like coupling in different frequency bands - yes.
But both caused by insufficient isolation.
It also suggests there may be other similar mechanisms waiting to be
found.
The third, less important factor is that RowPress appears quiteFor me the BER is irrelevant if it is above 0.0.
robust to differences between major manufacturers.
However, one should not overlook that efficiency of RowPress attacks
when measured by the most important criterion of BER per duration of
attack is many times lower than earlier techniques of double-sided
and multi-sided hammering.
I want to know where the errors start which is ACmin.
So, call it time to first flip. The principle is the same.
Still, MSRH causes harm faster than RP.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 716 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 54:18:38 |
| Calls: | 12,117 |
| Calls today: | 8 |
| Files: | 15,010 |
| Messages: | 6,518,629 |
| Posted today: | 2 |