Vir Campestris <[email protected]d> writes:
[...]
I was hoping for some feedback, even if very brief,
before I continue.
Vir Campestris <[email protected]d> writes:
On 02/07/2024 07:20, Tim Rentsch wrote:
I got the code. [...]
I checked. The modulus operations (%) are easy to check for.
There are two for each prime number. Not for each multiple of it, but
for the primes themselves.
And there's one in the "get" function that checks to see if a number
is prime. [...]
And one divide for each modulus.
Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair
N/30 , N%30
stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find
(i*30+a) * (j*30+b)
which is
(i*30*j*30) + (i*30*b) + (j*30*a) + a*b
Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by
30*i*j + i*b + j*a + (a*b)/30, (a*b)%30
The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.
30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]
Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.
One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.
Does this all make sense?
On 23/07/2024 15:34, Tim Rentsch wrote:
Vir Campestris <[email protected]d> writes:
[...]
I was hoping for some feedback, even if very brief,
before I continue.
Sorry, I was slow. It's really odd, I'm retired, and I ought to
have lots of time for things like this... and yet I seem to be
busy all the time.
On 20/07/2024 15:41, Tim Rentsch wrote:[...]
Vir Campestris <[email protected]d> writes:
On 02/07/2024 07:20, Tim Rentsch wrote:
Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair
N/30 , N%30
stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find
I suspect the cost of extracting the divided value from the 7
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.
(i*30+a) * (j*30+b)
which is
(i*30*j*30) + (i*30*b) + (j*30*a) + a*b
Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by
30*i*j + i*b + j*a + (a*b)/30, (a*b)%30
The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.
30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]
Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.
One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.
Does this all make sense?
Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.
What I had considered is an extrapolation from my original code.
On 20/07/2024 15:41, Tim Rentsch wrote:
Vir Campestris <[email protected]d> writes:
On 02/07/2024 07:20, Tim Rentsch wrote:
I had also considered looking for a mod-div value greater than 30.
Bonita's original code stored a bit for every prime. We can call
that a compression ratio of 1. I store the odd only, so the
compression ratio is 2. By storing mod30 you increase that to
3.75. But there's a cost on modern computers because it implies
lots of byte level access to store.
If on the other hand we store mod(2*3*5*7*11), which is 2310, we
get 480 candidates. 480 isn't far off 512 which is a nice size
for us to handle, and gives us an even better compression ratio of
just over 4.5. But of course all those tables with 8 or 30
entries will be getting a bit big and painful.
On 20/07/2024 15:41, Tim Rentsch wrote:[...]
Vir Campestris <[email protected]d> writes:
On 02/07/2024 07:20, Tim Rentsch wrote:
What I had considered is an extrapolation from my original code.
For each prime my original code starts with p^2, and sets the bit.
It than adds p*2, and sets the bit for that. Since I wasn't
storing even numbers every prime is odd, which means p^2 is odd.
p^2+p would be even, so I skip that and go on to p^2+p*2, and so
on.
Given the mod30 storage I should be able to make a table with 8
entries, each containing a step and a mask. The step is the
number of bytes in the big table to step to for the next bit to
set; the mask is the bit to set.
For 37, for example, the first bits we need to set are these:
Mul'ple value /30 Step %30
37 1369 45 19
41 1517 50 5 17
47 1739 57 7 29
49 1813 60 3 13
53 1961 65 5 11
59 2183 72 7 23
61 2257 75 3 7
67 2479 82 7 19
71 2627 87 5 17
77 2849 94 7 29
79 2923 97 3 13
83 3071 102 5 11
89 3293 109 7 23
91 3367 112 3 7
You'll see the pattern; there is a repeat 8 lines long of both
the Step (which is the delta in the word we need to access) and
the %30 value.
So what the code should do is start at p^2, and build this table.
It'll be different for every prime, and will be quite expensive to
build. But once the table is built it can add the step onto the
current table index, then set the bit corresponding to the %30
value. The table should of course contain only the step and the
bitmask corresponding to the %30 value.
(I used 37 in this example BTW, because I wanted a value more than
30, and with 31 the /30 column increments by 1 each time I added a
prime)
I suspect the cost of extracting the divided value from the 7Two tables, absolutely. What does BICBW stand for?
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.
Note that the %30 tables can be shared. There are 8 of them, one
for each of the eight residues mod 30 of the original prime.
Vir Campestris <[email protected]d> writes:
On 23/07/2024 15:34, Tim Rentsch wrote:
Vir Campestris <[email protected]d> writes:
[...]
I was hoping for some feedback, even if very brief,
before I continue.
Sorry, I was slow. It's really odd, I'm retired, and I ought to
have lots of time for things like this... and yet I seem to be
busy all the time.
No worries, I understand.
I'm sorry to be so long in responding. Among other things
there was some sort of snafu with my news server, resulting
in hundreds of lost and/or duplicated messages, and that
took a while to sort out.
On 20/07/2024 15:41, Tim Rentsch wrote:[...]
Vir Campestris <[email protected]d> writes:
On 02/07/2024 07:20, Tim Rentsch wrote:
Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair
N/30 , N%30
stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find
I suspect the cost of extracting the divided value from the 7
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.
Two tables, absolutely. What does BICBW stand for?
(i*30+a) * (j*30+b)
which is
(i*30*j*30) + (i*30*b) + (j*30*a) + a*b
Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by
30*i*j + i*b + j*a + (a*b)/30, (a*b)%30
The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.
30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]
Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.
One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.
Does this all make sense?
Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.
What I had considered is an extrapolation from my original code.
I read the comments below in conjunction with getting a better
understanding of your original code, which is part of what took
me so long before responding. Before I get to the stuff below
(which I think I will save for a second reply), here is some code
expanding on what I said above. It should compile cleanly as is.
However, its purpose is to illustrate what I said above.
typedef unsigned long NN, Index, Count, Mask;
typedef unsigned char Byte;
static void remove_multiples( Index, Count, Index );
static Byte bytes[1000000000];
static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };
Count
sieve( NN limit ){
Count r = 0;
Index i;
Count shift;
Index i_limit = (limit + 29)/30;
bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i];
for( shift = 0; shift < 8; shift++ ){
NN p = i*30 + mods[shift];
if( bits & 1<<shift ) continue;
if( p >= limit ) continue;
r++; // count how many primes found
// could print the prime 'p' here if desired
if( p*p < limit ) remove_multiples( i, shift, i_limit );
}
}
return r;
}
static Byte shift_for[30] = {
9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7,
};
static Byte ab_div_30[8][8] = {
{ 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 },
{ 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 },
{ 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 },
{ 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 },
{ 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 },
{ 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 },
{ 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 },
{ 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 },
};
static Byte ab_mod_30[8][8] = {
{ 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 },
{ 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 },
{ 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 },
{ 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 },
{ 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 },
{ 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 },
{ 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 },
{ 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 },
};
void
remove_multiples( Index i, Count shift_a, Index i_limit ){
Index j, k;
NN a = mods[shift_a], b, c;
Count shift;
for( shift = shift_a; shift < 8; shift++ ){
j = i;
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
for( shift = 0; shift < 8; shift++ ){
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
}
}
In both sieve() and remove_multiples(), there is an inner loop
that ranges over the possible shift values. (There is also an
unnested loop in remove_multiples() that ranges over a subset of
shift values, but for now please ignore that.)
Since these loops always perform exactly eight iterations, they
can be completely unrolled without expanding the code too much.
Here is a revised version of the outer loop in sieve() (actually
the code below is not yet ready for compilation, because some
things have changed):
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i] ^ 0xFF;
if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
}
We changed the interface to remove_multiples() to pass along the
index value 'i' and also the value for 'a' rather than the value
calculated for the prime 'p'. Since the loop has been unrolled,
the values for the 'a' arguments are all constants (and of course
all the mask values are constants).
After making a similar change to the remove_multiples() loop, the
values for the 'b's are likewise all constants. If the optimizer
is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
into constants. Also the expressions 'i*b' and 'j*a' all turn
into multiplication by a constant. Furthermore, to determine the
appropriate bit value to 'or' in, an eight-choice conditional
expression that tests each of the eight possible residues can
similarly be optimized away so the masks to 'or' in are all
constants. Do you see how that all works?
The point of all this unrolling is to get rid of all the shifting
and table lookups. The only variables involved are the loop
variables 'i' and 'j'; everything else has been turned into a
constant. Granted, the generated code has been expanded a great
deal (basically a factor of 8*8 == 64), but the pieces are now
much simpler so the result is still manageable.
I wanted to be sure you understand how the original idea is
transformed into a better performing version before posting
something more refined. I get that your time is at a premium,
still you might trying going through the exercise of unrolling
the loops and writing a set_product() routine (or macro) to get a
sense of where I'm going here.
Hope to send response to send part sometime soon.
(replying to two messages at once)
On 10/08/2024 15:07, Tim Rentsch wrote:
I suspect the cost of extracting the divided value from the 7
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.
Two tables, absolutely. What does BICBW stand for?
But I Could Be Wrong. I thought that one was fairly well known.
On 11/08/2024 08:00, Tim Rentsch wrote:
Note that the %30 tables can be shared. There are 8 of them, one
for each of the eight residues mod 30 of the original prime.
I started coding that up. Then it occurred to me that there wasn't
going to be much saving by storing a pointer (8 bytes on my machine)
to the correct 8-byte long table. I suppose I could store a byte array
index instead, but I've gone off in other directions.
You've picked up my cache management tweak.
I'll read through your code. But it might not be today. We've got the hottest day of the year here, and I don't have aircon in my garden
office. In winter the computer and me are enough to keep it warm, but
right now it's getting warmer by the minute.
I'm going to come back to this slightly earlier version, and add
comments inline.
On 10/08/2024 15:07, Tim Rentsch wrote:
Vir Campestris <[email protected]d> writes:
On 20/07/2024 15:41, Tim Rentsch wrote:
Vir Campestris <[email protected]d> writes:
On 02/07/2024 07:20, Tim Rentsch wrote:
[...]
Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair
N/30 , N%30
stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find
(i*30+a) * (j*30+b)
which is
(i*30*j*30) + (i*30*b) + (j*30*a) + a*b
Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by
30*i*j + i*b + j*a + (a*b)/30, (a*b)%30
The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.
30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]
Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.
One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.
Does this all make sense?
Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.
What I had considered is an extrapolation from my original code.
I read the comments below in conjunction with getting a better
understanding of your original code, which is part of what took
me so long before responding. Before I get to the stuff below
(which I think I will save for a second reply), here is some code
expanding on what I said above. It should compile cleanly as is.
However, its purpose is to illustrate what I said above.
typedef unsigned long NN, Index, Count, Mask;
typedef unsigned char Byte;
I've been using uint64_t, not unsigned long. When I started C
unsigned long was 32 bits; I'm never quite sure what you'll get.
static void remove_multiples( Index, Count, Index );
static Byte bytes[1000000000];
static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };
Count
sieve( NN limit ){
Count r = 0;
Index i;
Count shift;
Index i_limit = (limit + 29)/30;
bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i];
for( shift = 0; shift < 8; shift++ ){
NN p = i*30 + mods[shift];
if( bits & 1<<shift ) continue;
if( p >= limit ) continue;
r++; // count how many primes found
// could print the prime 'p' here if desired
if( p*p < limit ) remove_multiples( i, shift, i_limit ); >> }
}
return r;
}
static Byte shift_for[30] = {
9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7, >> };
static Byte ab_div_30[8][8] = {
{ 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 }, >> { 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 }, >> { 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 }, >> { 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 }, >> { 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 }, >> { 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 }, >> { 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 }, >> { 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 }, >> };
static Byte ab_mod_30[8][8] = {
{ 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 }, >> { 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 }, >> { 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 }, >> { 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 }, >> { 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 }, >> { 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 }, >> { 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 }, >> { 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 }, >> };
constexpr on those tables might give the optimiser something else to
play with.
void
remove_multiples( Index i, Count shift_a, Index i_limit ){
So i is (prime/30) and shift_a is the bit in the mask byte -
corresponding to (prime%30)?
Index j, k;
NN a = mods[shift_a], b, c;
Count shift;
for( shift = shift_a; shift < 8; shift++ ){
j = i;
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
for( shift = 0; shift < 8; shift++ ){
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
}
}
In both sieve() and remove_multiples(), there is an inner loop
that ranges over the possible shift values. (There is also an
unnested loop in remove_multiples() that ranges over a subset of
shift values, but for now please ignore that.)
Since these loops always perform exactly eight iterations, they
can be completely unrolled without expanding the code too much.
Here is a revised version of the outer loop in sieve() (actually
the code below is not yet ready for compilation, because some
things have changed):
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i] ^ 0xFF;
if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
}
Before unrolling always benchmark. (Modern compilers are pretty smart,
and sometimes will unroll for you. Modern processors with branch
prediction are pretty good about working out when to jump and when not
to. And with speculative execution they sometimes run both paths from
a branch... I could go on!)
We changed the interface to remove_multiples() to pass along the
index value 'i' and also the value for 'a' rather than the value
calculated for the prime 'p'. Since the loop has been unrolled,
the values for the 'a' arguments are all constants (and of course
all the mask values are constants).
After making a similar change to the remove_multiples() loop, the
values for the 'b's are likewise all constants. If the optimizer
is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
into constants. Also the expressions 'i*b' and 'j*a' all turn
into multiplication by a constant. Furthermore, to determine the
appropriate bit value to 'or' in, an eight-choice conditional
expression that tests each of the eight possible residues can
similarly be optimized away so the masks to 'or' in are all
constants. Do you see how that all works?
The point of all this unrolling is to get rid of all the shifting
and table lookups. The only variables involved are the loop
variables 'i' and 'j'; everything else has been turned into a
constant. Granted, the generated code has been expanded a great
deal (basically a factor of 8*8 == 64), but the pieces are now
much simpler so the result is still manageable.
And now a few minutes later as I read that I understand the reason
for the unrolling better. But my point still stands - benchmark
it.
Of course the decision on whether to unroll or not might depend on
the target CPU!
I wanted to be sure you understand how the original idea is
transformed into a better performing version before posting
something more refined. I get that your time is at a premium,
still you might trying going through the exercise of unrolling
the loops and writing a set_product() routine (or macro) to get a
sense of where I'm going here.
Hope to send response to send part sometime soon.
You seem to have spotted a pattern in the steps that I didn't - I
looked at the mask values, and decided it wasn't worth working out
which of the 8 possible sequences to use - remembering which was too expensive. But it's also (I think, I haven't coded it) to work out
whether you need to step 2, 4, 8 times the prime. And that would save
more memory.
Another thought BTW - whether it's worth storing with a larger modulo
- say 2*3*5*7 - not just 2*3*5 is a different decision from deciding
whether the steps between primes should take that into consideration.
Vir Campestris <[email protected]d> writes:
You seem to have spotted a pattern in the steps that I didn't - I
looked at the mask values, and decided it wasn't worth working out
which of the 8 possible sequences to use - remembering which was
too expensive. But it's also (I think, I haven't coded it) to work
out whether you need to step 2, 4, 8 times the prime. And that
would save more memory.
The possible steps for mod30 are 2 or 4 or 6 times the prime.
Another thought BTW - whether it's worth storing with a larger
modulo - say 2*3*5*7 - not just 2*3*5 is a different decision from
deciding whether the steps between primes should take that into
consideration.
My guess is that the possible steps are still just 2 or 4 or 6
(times the prime), but I haven't checked that.
But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.
But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.
Am 20.08.2024 um 17:21 schrieb Bonita Montero:
Am 19.08.2024 um 22:23 schrieb Vir Campestris:
On 16/08/2024 18:35, Bonita Montero wrote:
<snip>
But basically I don't think it is a good idea to skip numbers exeptIt's not just storage you save, it's also computation.
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.
That program I published up-thread is almost as fast as yours -
within 10% of the elapsed time - while only using a single core. The
amount of CPU used is much lower.
Andy
On my AMD 7990X all primes up to 2 ^ 32 are calculated with my code
with a 65W-setting of the CPU in 0.168s. The total CPU-time is 3.688s.
With your code all primes up to 2 ^ 32 take nearly exactly four seconds
with a single thread. So the overall computation time with my algorithm
is about seconds less.
And if I run my code with a single core:
    C:\Users\Boni\Documents\Source\bitmapSieve>timep "x64\Release-clang++\bitmapSieve.exe" 0x100000000 "" 1
    real  1.795s
    user  1.766s
    sys   0.000s
    cylces 8.011.804.500
Thats less than half the computation time.
So I'm a little late, but here's my effort to use the modulo 30 trick.
Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5 5600x, 64GB RAM
g++ -O3 -std=c++17
5761455 primess less than 100 million in 0.182269s
50847534 primes less than 1 billion in 2.841167s
455052511 primes less than 10billion in 53.009133s
Am 19.08.2024 um 22:34 schrieb Vir Campestris:
I wrote a small program
Bonita Montero <[email protected]> writes:
Am 19.08.2024 um 22:34 schrieb Vir Campestris:
I wrote a small program
Most people would use the standard command line
tools to do this (e.g. sort, uniq and diff).
I did "findstr /x "66049" primes.txt" on my output and it didn't
find that number with 1E10 as its maximum.
But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.
On 16/08/2024 18:35, Bonita Montero wrote:
But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.
I've been running some experiments.
Skipping evens only is nice and simple on computation; that's good.
Skipping something else requires table lookups. I've knocked up some
code that uses the correct table for skipping primes (but not for mask
bit selection) and run it for 2*3*5, 2*3*5*7, and 2*3*5*7*11.
So I'm a little late, but here's my effort to use the modulo 30
trick.
Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5 5600x,
64GB RAM
g++ -O3 -std=c++17
5761455 primess less than 100 million in 0.182269s
50847534 primes less than 1 billion in 2.841167s
455052511 primes less than 10billion in 53.009133s
[..program..]
On 20/08/2024 05:08, red floyd wrote:
So I'm a little late, but here's my effort to use the modulo
30 trick.
Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5
5600x, 64GB RAM
g++ -O3 -std=c++17
5761455 primess less than 100 million in 0.182269s
50847534 primes less than 1 billion in 2.841167s
455052511 primes less than 10billion in 53.009133s
It's slow.
Looking quickly at the (remarkably small) code I see line 30:
return std::make_tuple(val / MOD_VALUE, masks[val % MOD_VALUE]);
That mod and div are being called for every single time you
mark a prime in the bitmap.
On 8/22/2024 1:56 PM, Vir Campestris wrote:
On 16/08/2024 18:35, Bonita Montero wrote:
But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.
I've been running some experiments.
Skipping evens only is nice and simple on computation; that's good.
Skipping something else requires table lookups. I've knocked up some
code that uses the correct table for skipping primes (but not for
mask bit selection) and run it for 2*3*5, 2*3*5*7, and 2*3*5*7*11.
You can skip multiples of three, by starting with 7, and then
alternately adding 4 then 2.
e.g.
incr = 2;
for (val = 7 ; val < MAX_PRIME_TO_CHECK ; val += incr)
{
check_for_prime(val);
incr = 6 - incr;
}
On 20/08/2024 16:24, Bonita Montero wrote:
Am 20.08.2024 um 17:21 schrieb Bonita Montero:
Am 19.08.2024 um 22:23 schrieb Vir Campestris:
On 16/08/2024 18:35, Bonita Montero wrote:
<snip>
But basically I don't think it is a good idea to skip numbers
exept multiples of two. [...] I think this extra computation
is higher than the time for the saved memory loads.
I'm leaning towards thinking you may have a point on the mod30
code. There are more operations on the innermost loop with mod30,
although the loop goes around fewer times. It also means that the
store is forced to be byte wide - I find that my original odd-only
code is significantly faster - about 30% - when the store is 64
bit wide rather than byte. [...]
On 16/08/2024 18:35, Bonita Montero wrote:
But basically I don't think it is a good idea to skip numbers
exept multiples of two. [...]
I've been running some experiments.
Skipping evens only is nice and simple on computation; that's
good.
Skipping something else requires table lookups. [...]
I'm happy with my algorithm:
- Work out a bitmask for some small primes, only as far as is
needed to get repeats (the repeat length is the product of the
primes)
- Work out bitmasks for the next half a dozen. Each of those will
have a repeat the length of the prime.
- Combine those together into the output bitmap. That requires
fewer operations than doing each one into the output bitmap
- Process the larger primes. Do the output bitmap in blocks for
cache friendliness
Am 26.08.2024 um 21:08 schrieb Tim Rentsch:
One motivation for choosing a mod30 representation is being able to
compute more primes -- almost twice as many. Any machine with 64GB
of memory should be able to compute primes up to 1.5 trillion.
You don't need to hold all primes in memory. Just calculate the primes
up to the square root and then you can calculate every segment up to
the maximum from that.
On 27/08/2024 05:09, Bonita Montero wrote:
Am 26.08.2024 um 21:08 schrieb Tim Rentsch:
One motivation for choosing a mod30 representation is being able to
compute more primes -- almost twice as many. Any machine with 64GB
of memory should be able to compute primes up to 1.5 trillion.
You don't need to hold all primes in memory. Just calculate the primes
up to the square root and then you can calculate every segment up to
the maximum from that.
So you compute all the primes up to the square root of a larger number.
I was cleaning up my odds-only code with a view to posting it, when I
had an idea.
Take 101 (because I can write things more easily).
I want to mark
10201,10302,10403,10504,10605,10706...
With odds only that becomes
10201,10403,10605,10807,11009,11211...
and as the loop increment is a constant it is nice and fast. With
mod30 it isn't a constant, and it hurts to work out the increment.
Except...
If I mark 10201,13231,16261,19291,22321
with an increment of 30*p,
then go back to 10201, add 2p, then increment that by 30
10201 13231 16261
and do that for the other 6 of the 8, my inner loop has a constant
increment, and my code is compatible with the mod30 store.
That might be quicker. I also have some thoughts over storing the
prime as two parts. It's 30m+r, where m is modulus and r remainder. r
maps one-for-one into the byte mask, and m is the index.
I'm going to waste lots more time on this I can see!
Vir Campestris <[email protected]d> writes:
On 27/08/2024 05:09, Bonita Montero wrote:
Am 26.08.2024 um 21:08 schrieb Tim Rentsch:
One motivation for choosing a mod30 representation is being able to
compute more primes -- almost twice as many. Any machine with 64GB
of memory should be able to compute primes up to 1.5 trillion.
You don't need to hold all primes in memory. Just calculate the primes
up to the square root and then you can calculate every segment up to
the maximum from that.
This is a brainless statement (and incidentally illustrates why I
was motivated not to read postings from BM). In fact no primes need
be pre-calculated to determine whether any given number is prime.
The point of using a sieve is the sieve method is much faster than
not using one. For example, suppose we want to determine all primes
less than a trillion. Taking the approach of pre-computing only
those primes less than a million (the square root) and then testing
numbers individually takes more than 100 times as many operations as
using a sieve. Furthermore the operations used are more expensive
for the non-sieve approach - simply setting a bit in the case of a
sieve, versus computing a remainder in the non-sieve case.
So you compute all the primes up to the square root of a larger number.
I was cleaning up my odds-only code with a view to posting it, when I
had an idea.
Take 101 (because I can write things more easily).
I want to mark
10201,10302,10403,10504,10605,10706...
With odds only that becomes
10201,10403,10605,10807,11009,11211...
Right. Marking of multiples needs to start only at p*p, not
at p+2.
and as the loop increment is a constant it is nice and fast. With
mod30 it isn't a constant, and it hurts to work out the increment.
Except...
If I mark 10201,13231,16261,19291,22321
with an increment of 30*p,
then go back to 10201, add 2p, then increment that by 30
10201 13231 16261
and do that for the other 6 of the 8, my inner loop has a constant
increment, and my code is compatible with the mod30 store.
Yes, reordering the loops this way might very well make things
work better. It also has the property, I believe, that the
bit to set is the same in each of the eight outer loops, and
so can be computed only once for each of the eight outer loops.
Very good!
That might be quicker. I also have some thoughts over storing the
prime as two parts. It's 30m+r, where m is modulus and r remainder. r
maps one-for-one into the byte mask, and m is the index.
This is what I've been saying all along!
I'm going to waste lots more time on this I can see!
I'm interested to see what you come up with.
You can skip multiples of three, by starting with 7, and then
alternately adding 4 then 2.
Tim Rentsch <[email protected]> writes:
Vir Campestris <[email protected]d> writes:
[...]
I tried to post a short followup, but I must have done something
wrong because nothing went out. Long story short, I've been
pulled away from looking at this further due to other tasks
demanding my attention, and also because the Sieve of Aiken
looks more promising (having been reminded recently of
primegen by Daniel Bernstein).
Typo: Sieve of Atkin.
https://en.wikipedia.org/wiki/Sieve_of_Atkin
In any case, thank you for the back and forth, it's been fun.
On 28/09/2024 21:49, Keith Thompson wrote:
Tim Rentsch <[email protected]> writes:
Vir Campestris <[email protected]d> writes:
[...]
I tried to post a short followup, but I must have done something
wrong because nothing went out. Long story short, I've been
pulled away from looking at this further due to other tasks
demanding my attention, and also because the Sieve of Aiken
looks more promising (having been reminded recently of
primegen by Daniel Bernstein).
Typo: Sieve of Atkin.
https://en.wikipedia.org/wiki/Sieve_of_Atkin
In any case, thank you for the back and forth, it's been fun.
[...]
I now have a working version storing all primes as (prime/30
mask(prime%30) and tuned up.
The disappointing thing is that it's not until you get to seriously
big numbers - something in the 1e9 range - that this is faster than
the code I posted before. And even then it's not _much_ faster. About
20% at 1e11.
Over in comp.lang.c this came up too, and there's a link there to the
Sieve of Atkin
<https://en.wikipedia.org/wiki/Sieve_of_Atkin>
and to a program
<http://cr.yp.to/primegen.html>
which implements it.
I'm pleased to be able to report that that program, written for a 1998
CPU, is slower than mine on my modern CPU.
This reaction, more broadly, has been rather surprising. The whole
point of using a sieve is that it is asymptotically faster. Who
cares about how fast primes under a billion, or ten billion, or
fifty billion, can be computed? It's much more interesting to ask
how high a limit can be searched in a day or a week of computing.
(I don't mean just your comment, but comments from other folks
bragging about how fast they can find all 32-bit primes.) I ran
my earlier programs up to limits of 3 trillion or so.
On 07/10/2024 16:41, Tim Rentsch wrote:
This reaction, more broadly, has been rather surprising. The whole
point of using a sieve is that it is asymptotically faster. Who
cares about how fast primes under a billion, or ten billion, or
fifty billion, can be computed? It's much more interesting to ask
how high a limit can be searched in a day or a week of computing.
(I don't mean just your comment, but comments from other folks
bragging about how fast they can find all 32-bit primes.) I ran
my earlier programs up to limits of 3 trillion or so.
My programs all require the sieved prime map to fit in RAM. Which in
my system limits me to about 10^11 primes.
It has occurred to me to extend it for much larger numbers, by paging
out to disc files. But I've wasted far too much time on this already!
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 159:20:21 |
| Calls: | 12,094 |
| Calls today: | 2 |
| Files: | 15,000 |
| Messages: | 6,517,759 |