Forum: >>> Magnum BBS <<<

OT: Re: Sieve of Erastosthenes optimized to the max

From Vir Campestris@21:1/5 to Tim Rentsch on Thu Jul 25 12:46:02 2024

On 23/07/2024 15:34, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

[...]

I was hoping for some feedback, even if very brief,
before I continue.

Sorry, I was slow. It's really odd, I'm retired, and I ought to have
lots of time for things like this... and yet I seem to be busy all the time.

On 20/07/2024 15:41, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 02/07/2024 07:20, Tim Rentsch wrote:

I got the code. [...]

I checked. The modulus operations (%) are easy to check for.

There are two for each prime number. Not for each multiple of it, but
for the primes themselves.

And there's one in the "get" function that checks to see if a number
is prime. [...]

And one divide for each modulus.

I found searching my code for the '%' character was quite easy.
'/', however, is in all my comments, so I get lots of false positives.

Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair

N/30 , N%30

stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find

I suspect the cost of extracting the divided value from the 7 bytes will
be prohibitive. You may find it's best to have one table for the N/30
parts (big entries, word aligned) and another for the N%30 part (small
entries, byte aligned). BICBW.

(i*30+a) * (j*30+b)

which is

(i*30*j*30) + (i*30*b) + (j*30*a) + a*b

Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by

30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.

30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.

One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.

Does this all make sense?

Yes, it does. It's not a direction I was thinking of at all, although I
too had table lookups in mind.

What I had considered is an extrapolation from my original code.

For each prime my original code starts with p^2, and sets the bit. It
than adds p*2, and sets the bit for that. Since I wasn't storing even
numbers every prime is odd, which means p^2 is odd. p^2+p would be even,
so I skip that and go on to p^2+p*2, and so on.

Given the mod30 storage I should be able to make a table with 8 entries,
each containing a step and a mask. The step is the number of bytes in
the big table to step to for the next bit to set; the mask is the bit to
set.

For 37, for example, the first bits we need to set are these:

Mul'ple value /30 Step %30
37 1369 45 19
41 1517 50 5 17
47 1739 57 7 29
49 1813 60 3 13
53 1961 65 5 11
59 2183 72 7 23
61 2257 75 3 7

67 2479 82 7 19
71 2627 87 5 17
77 2849 94 7 29
79 2923 97 3 13
83 3071 102 5 11
89 3293 109 7 23
91 3367 112 3 7

You'll see the pattern; there is a repeat 8 lines long of both the Step
(which is the delta in the word we need to access) and the %30 value.

So what the code should do is start at p^2, and build this table. It'll
be different for every prime, and will be quite expensive to build. But
once the table is built it can add the step onto the current table
index, then set the bit corresponding to the %30 value. The table should
of course contain only the step and the bitmask corresponding to the %30
value.

(I used 37 in this example BTW, because I wanted a value more than 30,
and with 31 the /30 column increments by 1 each time I added a prime)

For small values, as we discussed, it's going to be faster to copy the
small table for that prime than to apply the mask to each word of the
store individually.

I had also considered looking for a mod-div value greater than 30.

Bonita's original code stored a bit for every prime. We can call that a compression ratio of 1. I store the odd only, so the compression ratio
is 2. By storing mod30 you increase that to 3.75. But there's a cost on
modern computers because it implies lots of byte level access to store.

If on the other hand we store mod(2*3*5*7*11), which is 2310, we get 480 candidates. 480 isn't far off 512 which is a nice size for us to handle,
and gives us an even better compression ratio of just over 4.5. But of
course all those tables with 8 or 30 entries will be getting a bit big
and painful.

I might even get around to writing some of it. Be warned, this is a time
trap!

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Sat Aug 10 07:07:07 2024

Vir Campestris <[email protected]d> writes:

On 23/07/2024 15:34, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

[...]

I was hoping for some feedback, even if very brief,
before I continue.

Sorry, I was slow. It's really odd, I'm retired, and I ought to
have lots of time for things like this... and yet I seem to be
busy all the time.

No worries, I understand.

I'm sorry to be so long in responding. Among other things
there was some sort of snafu with my news server, resulting
in hundreds of lost and/or duplicated messages, and that
took a while to sort out.

On 20/07/2024 15:41, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 02/07/2024 07:20, Tim Rentsch wrote:

[...]

Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair

N/30 , N%30

stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find

I suspect the cost of extracting the divided value from the 7
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.

Two tables, absolutely. What does BICBW stand for?

(i*30+a) * (j*30+b)

which is

(i*30*j*30) + (i*30*b) + (j*30*a) + a*b

Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by

30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.

30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.

One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.

Does this all make sense?

Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.

What I had considered is an extrapolation from my original code.

I read the comments below in conjunction with getting a better
understanding of your original code, which is part of what took
me so long before responding. Before I get to the stuff below
(which I think I will save for a second reply), here is some code
expanding on what I said above. It should compile cleanly as is.
However, its purpose is to illustrate what I said above.

typedef unsigned long NN, Index, Count, Mask;
typedef unsigned char Byte;

static void remove_multiples( Index, Count, Index );

static Byte bytes[1000000000];
static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };

Count
sieve( NN limit ){
Count r = 0;
Index i;
Count shift;
Index i_limit = (limit + 29)/30;

bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i];
for( shift = 0; shift < 8; shift++ ){
NN p = i*30 + mods[shift];
if( bits & 1<<shift ) continue;
if( p >= limit ) continue;
r++; // count how many primes found
// could print the prime 'p' here if desired
if( p*p < limit ) remove_multiples( i, shift, i_limit );
}
}

return r;
}

static Byte shift_for[30] = {
9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7,
};

static Byte ab_div_30[8][8] = {
{ 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 },
{ 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 },
{ 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 },
{ 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 },
{ 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 },
{ 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 },
{ 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 },
{ 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 },
};

static Byte ab_mod_30[8][8] = {
{ 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 },
{ 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 },
{ 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 },
{ 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 },
{ 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 },
{ 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 },
{ 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 },
{ 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 },
};

void
remove_multiples( Index i, Count shift_a, Index i_limit ){
Index j, k;
NN a = mods[shift_a], b, c;
Count shift;

for( shift = shift_a; shift < 8; shift++ ){
j = i;
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}

for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
for( shift = 0; shift < 8; shift++ ){
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
}
}

In both sieve() and remove_multiples(), there is an inner loop
that ranges over the possible shift values. (There is also an
unnested loop in remove_multiples() that ranges over a subset of
shift values, but for now please ignore that.)

Since these loops always perform exactly eight iterations, they
can be completely unrolled without expanding the code too much.
Here is a revised version of the outer loop in sieve() (actually
the code below is not yet ready for compilation, because some
things have changed):

for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i] ^ 0xFF;
if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
}

We changed the interface to remove_multiples() to pass along the
index value 'i' and also the value for 'a' rather than the value
calculated for the prime 'p'. Since the loop has been unrolled,
the values for the 'a' arguments are all constants (and of course
all the mask values are constants).

After making a similar change to the remove_multiples() loop, the
values for the 'b's are likewise all constants. If the optimizer
is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
into constants. Also the expressions 'i*b' and 'j*a' all turn
into multiplication by a constant. Furthermore, to determine the
appropriate bit value to 'or' in, an eight-choice conditional
expression that tests each of the eight possible residues can
similarly be optimized away so the masks to 'or' in are all
constants. Do you see how that all works?

The point of all this unrolling is to get rid of all the shifting
and table lookups. The only variables involved are the loop
variables 'i' and 'j'; everything else has been turned into a
constant. Granted, the generated code has been expanded a great
deal (basically a factor of 8*8 == 64), but the pieces are now
much simpler so the result is still manageable.

I wanted to be sure you understand how the original idea is
transformed into a better performing version before posting
something more refined. I get that your time is at a premium,
still you might trying going through the exercise of unrolling
the loops and writing a set_product() routine (or macro) to get a
sense of where I'm going here.

Hope to send response to send part sometime soon.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to All on Sat Aug 10 17:24:53 2024

Vir Campestris <[email protected]d> writes:

This post is my second response to your comments, more
specifically to one part of your comments.

On 20/07/2024 15:41, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 02/07/2024 07:20, Tim Rentsch wrote:

[...]

I had also considered looking for a mod-div value greater than 30.

Bonita's original code stored a bit for every prime. We can call
that a compression ratio of 1. I store the odd only, so the
compression ratio is 2. By storing mod30 you increase that to
3.75. But there's a cost on modern computers because it implies
lots of byte level access to store.

If on the other hand we store mod(2*3*5*7*11), which is 2310, we
get 480 candidates. 480 isn't far off 512 which is a nice size
for us to handle, and gives us an even better compression ratio of
just over 4.5. But of course all those tables with 8 or 30
entries will be getting a bit big and painful.

I tried using a mod 240 encoding, so the units are 64-bit elements,
to see if the more natural sized access would help. The result ran
slower than the mod 30 code.

The idea of using more primes seems like a natural extension, and
maybe it can work. Certainly it can get greater compression, which
is a plus. I had looked before at using a mod 210 encoding (which
is 30 * 7). My sense is that, unless the extra compression is
essential, the added code complexity will hurt performance more
than it helps. But I never got to the point of actually coding
it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to All on Sun Aug 11 00:00:50 2024

Vir Campestris <[email protected]d> writes:

This posting is my third (and I think last) response to your
last set of comments.

On 20/07/2024 15:41, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 02/07/2024 07:20, Tim Rentsch wrote:

[...]

What I had considered is an extrapolation from my original code.

About the original code.. One part of it took me a while to
figure out. I would summarize it as follows. Rather than
striking all multiples of a given prime before going on to the
next one, you put a limit on the memory of where the strikes
can occur to keep those bytes in the L1 cache. So several or
many primes are processed at once, as long as the products are
inside the target area (which I think is on the order of half a
megabyte). I haven't done any careful measurements, but it
does appear as though this approach provides a significant
performance gain. (I don't have any estimates for how much.)

A cost of using this method is it takes some memory to keep
track of where any non-finished primes are in their multiple
striking. That cost can be relevant if amount of memory needed
starts to be a significant portion of the L1 cache size.

For each prime my original code starts with p^2, and sets the bit.
It than adds p*2, and sets the bit for that. Since I wasn't
storing even numbers every prime is odd, which means p^2 is odd.
p^2+p would be even, so I skip that and go on to p^2+p*2, and so
on.

Given the mod30 storage I should be able to make a table with 8
entries, each containing a step and a mask. The step is the
number of bytes in the big table to step to for the next bit to
set; the mask is the bit to set.

For 37, for example, the first bits we need to set are these:

Mul'ple value /30 Step %30
37 1369 45 19
41 1517 50 5 17
47 1739 57 7 29
49 1813 60 3 13
53 1961 65 5 11
59 2183 72 7 23
61 2257 75 3 7

67 2479 82 7 19
71 2627 87 5 17
77 2849 94 7 29
79 2923 97 3 13
83 3071 102 5 11
89 3293 109 7 23
91 3367 112 3 7

You'll see the pattern; there is a repeat 8 lines long of both
the Step (which is the delta in the word we need to access) and
the %30 value.

Right. There is a regular pattern of deltas, corresponding to
a kind of strength reduction of doing the multiplies.

So what the code should do is start at p^2, and build this table.
It'll be different for every prime, and will be quite expensive to
build. But once the table is built it can add the step onto the
current table index, then set the bit corresponding to the %30
value. The table should of course contain only the step and the
bitmask corresponding to the %30 value.

Note that the %30 tables can be shared. There are 8 of them, one
for each of the eight residues mod 30 of the original prime.

(I used 37 in this example BTW, because I wanted a value more than
30, and with 31 the /30 column increments by 1 each time I added a
prime)

I get a different result for starting with 31 but that's not
important, I expect one of us made a simple calculation mistake.

If we are looking at a prime p = 30*i + a, then the first
delta for the /30 column is

(30*i+a) * (30*i+b) / 30 - (30*i+a) * (30*i+a) / 30

which is

(30*i*i + i*b + i*a + a*b/30) -
(30*i*i + i*a + i*a + a*a/30)

which is

i*(b-a) + (a*b/30 - a*a/30)

when we get to 8 elements later, the /30 column is

(30*i+a) * (30*(i+1)+b) / 30 - (30*i+a) * (30*(i+1)+a) / 30

which is

(30*i*i + i + i*b + i*a + a + a*b/30) -
(30*i*i + i + i*a + i*a + a + a*a/30)

which is again

i*(b-a) + (a*b/30 - a*a/30)

that is, the same delta as the one eight steps before. This
formula shows us how to calculate the deltas, with the help of one
table for the (a*b'/30 - a*b/30) values, and one table for the
(b'-b) values, which need to be multiplied by the appropriate i
value (namely, p/30).

So I think the delta calculation can be written

i*delta_b[x] + delta_bb[x]

where x goes in a circular loop over 8 values [ x0 ... x7 ].

All the above assuming I haven't made an algebraic mistakes, which
certainly is not guaranteed.

Besides showing how to calculate the deltas, the last formula shows
how to reduce the amount of memory needed to save the state of
processing a given prime. Namely, all that is needed is the
running value of what byte holds the product bit, the index i of
the original prime, and the value for x (three bits) along with
something that shows what range it cycles through (another three
bits). It should be easy to hold all that state in two 64-bit
words.

If instead we tried to store a table with 8 entries for each prime
that is still being processed, that would use a lot more memory and
eat into our L1 cache memory budget. At some point we have to
wonder if it would be cheaper to do more recalculation but need
less memory to keep track of where we are. Not a simple problem.

Forgive me if the above seems a bit scattered. I started with a
rough roadmap and just dived in and began writing. Feel free to
ask about anything that isn't clear or doesn't make sense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Tim Rentsch on Mon Aug 12 15:32:33 2024

(replying to two messages at once)

On 10/08/2024 15:07, Tim Rentsch wrote:

I suspect the cost of extracting the divided value from the 7
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.

Two tables, absolutely. What does BICBW stand for?

But I Could Be Wrong. I thought that one was fairly well known.

On 11/08/2024 08:00, Tim Rentsch wrote:

Note that the %30 tables can be shared. There are 8 of them, one
for each of the eight residues mod 30 of the original prime.

I started coding that up. Then it occurred to me that there wasn't going
to be much saving by storing a pointer (8 bytes on my machine) to the
correct 8-byte long table. I suppose I could store a byte array index
instead, but I've gone off in other directions.

You've picked up my cache management tweak.

I'll read through your code. But it might not be today. We've got the
hottest day of the year here, and I don't have aircon in my garden
office. In winter the computer and me are enough to keep it warm, but
right now it's getting warmer by the minute.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Tim Rentsch on Thu Aug 15 17:52:10 2024

I'm going to come back to this slightly earlier version, and add
comments inline.

On 10/08/2024 15:07, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 23/07/2024 15:34, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

[...]

I was hoping for some feedback, even if very brief,
before I continue.

Sorry, I was slow. It's really odd, I'm retired, and I ought to
have lots of time for things like this... and yet I seem to be
busy all the time.

No worries, I understand.

I'm sorry to be so long in responding. Among other things
there was some sort of snafu with my news server, resulting
in hundreds of lost and/or duplicated messages, and that
took a while to sort out.

I think that was Eternal September - he had an outage.

On 20/07/2024 15:41, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 02/07/2024 07:20, Tim Rentsch wrote:

[...]

Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair

N/30 , N%30

stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find

I suspect the cost of extracting the divided value from the 7
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.

Two tables, absolutely. What does BICBW stand for?

(i*30+a) * (j*30+b)

which is

(i*30*j*30) + (i*30*b) + (j*30*a) + a*b

Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by

30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.

30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.

One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.

Does this all make sense?

Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.

What I had considered is an extrapolation from my original code.

I read the comments below in conjunction with getting a better
understanding of your original code, which is part of what took
me so long before responding. Before I get to the stuff below
(which I think I will save for a second reply), here is some code
expanding on what I said above. It should compile cleanly as is.
However, its purpose is to illustrate what I said above.

typedef unsigned long NN, Index, Count, Mask;
typedef unsigned char Byte;

I've been using uint64_t, not unsigned long. When I started C unsigned
long was 32 bits; I'm never quite sure what you'll get.

static void remove_multiples( Index, Count, Index );

static Byte bytes[1000000000];
static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };

Count
sieve( NN limit ){
Count r = 0;
Index i;
Count shift;
Index i_limit = (limit + 29)/30;

bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i];
for( shift = 0; shift < 8; shift++ ){
NN p = i*30 + mods[shift];
if( bits & 1<<shift ) continue;
if( p >= limit ) continue;
r++; // count how many primes found
// could print the prime 'p' here if desired
if( p*p < limit ) remove_multiples( i, shift, i_limit );
}
}

return r;
}

static Byte shift_for[30] = {
9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7,
};

static Byte ab_div_30[8][8] = {
{ 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 },
{ 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 },
{ 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 },
{ 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 },
{ 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 },
{ 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 },
{ 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 },
{ 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 },
};

static Byte ab_mod_30[8][8] = {
{ 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 },
{ 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 },
{ 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 },
{ 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 },
{ 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 },
{ 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 },
{ 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 },
{ 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 },
};

constexpr on those tables might give the optimiser something else to
play with.

void
remove_multiples( Index i, Count shift_a, Index i_limit ){

So i is (prime/30) and shift_a is the bit in the mask byte -
corresponding to (prime%30)?

Index j, k;
NN a = mods[shift_a], b, c;
Count shift;

for( shift = shift_a; shift < 8; shift++ ){
j = i;
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}

for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
for( shift = 0; shift < 8; shift++ ){
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
}
}

In both sieve() and remove_multiples(), there is an inner loop
that ranges over the possible shift values. (There is also an
unnested loop in remove_multiples() that ranges over a subset of
shift values, but for now please ignore that.)

Since these loops always perform exactly eight iterations, they
can be completely unrolled without expanding the code too much.
Here is a revised version of the outer loop in sieve() (actually
the code below is not yet ready for compilation, because some
things have changed):

for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i] ^ 0xFF;
if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
}

Before unrolling always benchmark. (Modern compilers are pretty smart,
and sometimes will unroll for you. Modern processors with branch
prediction are pretty good about working out when to jump and when not
to. And with speculative execution they sometimes run both paths from a branch... I could go on!)

We changed the interface to remove_multiples() to pass along the
index value 'i' and also the value for 'a' rather than the value
calculated for the prime 'p'. Since the loop has been unrolled,
the values for the 'a' arguments are all constants (and of course
all the mask values are constants).

After making a similar change to the remove_multiples() loop, the
values for the 'b's are likewise all constants. If the optimizer
is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
into constants. Also the expressions 'i*b' and 'j*a' all turn
into multiplication by a constant. Furthermore, to determine the
appropriate bit value to 'or' in, an eight-choice conditional
expression that tests each of the eight possible residues can
similarly be optimized away so the masks to 'or' in are all
constants. Do you see how that all works?

The point of all this unrolling is to get rid of all the shifting
and table lookups. The only variables involved are the loop
variables 'i' and 'j'; everything else has been turned into a
constant. Granted, the generated code has been expanded a great
deal (basically a factor of 8*8 == 64), but the pieces are now
much simpler so the result is still manageable.

And now a few minutes later as I read that I understand the reason for
the unrolling better. But my point still stands - benchmark it.

Of course the decision on whether to unroll or not might depend on the
target CPU!

I wanted to be sure you understand how the original idea is
transformed into a better performing version before posting
something more refined. I get that your time is at a premium,
still you might trying going through the exercise of unrolling
the loops and writing a set_product() routine (or macro) to get a
sense of where I'm going here.

Hope to send response to send part sometime soon.

You seem to have spotted a pattern in the steps that I didn't - I looked
at the mask values, and decided it wasn't worth working out which of the
8 possible sequences to use - remembering which was too expensive. But
it's also (I think, I haven't coded it) to work out whether you need to
step 2, 4, 8 times the prime. And that would save more memory.

Another thought BTW - whether it's worth storing with a larger modulo -
say 2*3*5*7 - not just 2*3*5 is a different decision from deciding
whether the steps between primes should take that into consideration.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Fri Aug 16 07:48:07 2024

Vir Campestris <[email protected]d> writes:

(replying to two messages at once)

On 10/08/2024 15:07, Tim Rentsch wrote:

I suspect the cost of extracting the divided value from the 7
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.

Two tables, absolutely. What does BICBW stand for?

But I Could Be Wrong. I thought that one was fairly well known.

Ah. No wonder I didn't know it. ;)

On 11/08/2024 08:00, Tim Rentsch wrote:

Note that the %30 tables can be shared. There are 8 of them, one
for each of the eight residues mod 30 of the original prime.

I started coding that up. Then it occurred to me that there wasn't
going to be much saving by storing a pointer (8 bytes on my machine)
to the correct 8-byte long table. I suppose I could store a byte array
index instead, but I've gone off in other directions.

Right, I was meaning keeping track of the index, not a pointer.

You've picked up my cache management tweak.

It's an interesting idea, although I'm not sure how well it
integrates with what I've done.

I'll read through your code. But it might not be today. We've got the hottest day of the year here, and I don't have aircon in my garden
office. In winter the computer and me are enough to keep it warm, but
right now it's getting warmer by the minute.

My sympathies. I too find it hard to work when it's too hot.

I see you have started looking at my code... I'll take a look.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Fri Aug 16 08:40:37 2024

Vir Campestris <[email protected]d> writes:

I'm going to come back to this slightly earlier version, and add
comments inline.

Okay good deal. I'm taking out some parts that look like they
are no longer relevant.

On 10/08/2024 15:07, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 20/07/2024 15:41, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 02/07/2024 07:20, Tim Rentsch wrote:

[...]

Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair

N/30 , N%30

stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find

(i*30+a) * (j*30+b)

which is

(i*30*j*30) + (i*30*b) + (j*30*a) + a*b

Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by

30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.

30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.

One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.

Does this all make sense?

Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.

What I had considered is an extrapolation from my original code.

I read the comments below in conjunction with getting a better
understanding of your original code, which is part of what took
me so long before responding. Before I get to the stuff below
(which I think I will save for a second reply), here is some code
expanding on what I said above. It should compile cleanly as is.
However, its purpose is to illustrate what I said above.

typedef unsigned long NN, Index, Count, Mask;
typedef unsigned char Byte;

I've been using uint64_t, not unsigned long. When I started C
unsigned long was 32 bits; I'm never quite sure what you'll get.

I'm coming around to the point of view that size_t is the best
choice for these types. In any case though the point of using a
typedef is so the more meaningful names can be used but still
changed easily where they are defined.

static void remove_multiples( Index, Count, Index );

static Byte bytes[1000000000];
static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };

Count
sieve( NN limit ){
Count r = 0;
Index i;
Count shift;
Index i_limit = (limit + 29)/30;

bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i];
for( shift = 0; shift < 8; shift++ ){
NN p = i*30 + mods[shift];
if( bits & 1<<shift ) continue;
if( p >= limit ) continue;
r++; // count how many primes found
// could print the prime 'p' here if desired
if( p*p < limit ) remove_multiples( i, shift, i_limit ); >> }
}

return r;
}

static Byte shift_for[30] = {
9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7, >> };

static Byte ab_div_30[8][8] = {
{ 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 }, >> { 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 }, >> { 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 }, >> { 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 }, >> { 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 }, >> { 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 }, >> { 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 }, >> { 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 }, >> };

static Byte ab_mod_30[8][8] = {
{ 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 }, >> { 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 }, >> { 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 }, >> { 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 }, >> { 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 }, >> { 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 }, >> { 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 }, >> { 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 }, >> };

constexpr on those tables might give the optimiser something else to
play with.

All of these tables are going to disappear in the next revision.

void
remove_multiples( Index i, Count shift_a, Index i_limit ){

So i is (prime/30) and shift_a is the bit in the mask byte -
corresponding to (prime%30)?

Right. The value 'a' is 'mods[shift_a]', and the value 'shift_a' is 'shift_for[a]'. Again though we will be dispensing with shift_for[]
and put in code that can be compiled to give constant values.

Index j, k;
NN a = mods[shift_a], b, c;
Count shift;

for( shift = shift_a; shift < 8; shift++ ){
j = i;
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}

for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
for( shift = 0; shift < 8; shift++ ){
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
}
}

In both sieve() and remove_multiples(), there is an inner loop
that ranges over the possible shift values. (There is also an
unnested loop in remove_multiples() that ranges over a subset of
shift values, but for now please ignore that.)

Since these loops always perform exactly eight iterations, they
can be completely unrolled without expanding the code too much.
Here is a revised version of the outer loop in sieve() (actually
the code below is not yet ready for compilation, because some
things have changed):

for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i] ^ 0xFF;
if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
}

Before unrolling always benchmark. (Modern compilers are pretty smart,
and sometimes will unroll for you. Modern processors with branch
prediction are pretty good about working out when to jump and when not
to. And with speculative execution they sometimes run both paths from
a branch... I could go on!)

I understand the advice. Keep in mind though that the code I'm
showing (and also true in the next version) is presented as a way of
explaining how I got to where I got. I didn't start with the loop
version and unroll it; everything was unrolled right from the get
go. And there are more revisions to come, even after the next one
(to be shown below).

We changed the interface to remove_multiples() to pass along the
index value 'i' and also the value for 'a' rather than the value
calculated for the prime 'p'. Since the loop has been unrolled,
the values for the 'a' arguments are all constants (and of course
all the mask values are constants).

After making a similar change to the remove_multiples() loop, the
values for the 'b's are likewise all constants. If the optimizer
is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
into constants. Also the expressions 'i*b' and 'j*a' all turn
into multiplication by a constant. Furthermore, to determine the
appropriate bit value to 'or' in, an eight-choice conditional
expression that tests each of the eight possible residues can
similarly be optimized away so the masks to 'or' in are all
constants. Do you see how that all works?

The point of all this unrolling is to get rid of all the shifting
and table lookups. The only variables involved are the loop
variables 'i' and 'j'; everything else has been turned into a
constant. Granted, the generated code has been expanded a great
deal (basically a factor of 8*8 == 64), but the pieces are now
much simpler so the result is still manageable.

And now a few minutes later as I read that I understand the reason
for the unrolling better. But my point still stands - benchmark
it.

Of course the decision on whether to unroll or not might depend on
the target CPU!

For this approach to work it's crucial to unroll completely to make
sure all the divides and remainders, and also the mask choices, get
optimized out. I suppose it's possible that some architecture out
there could be faster without all constant folding, but it seems
unlikely, and in any case I made an early decision to write the code
using this approach, which has worked out well.

I wanted to be sure you understand how the original idea is
transformed into a better performing version before posting
something more refined. I get that your time is at a premium,
still you might trying going through the exercise of unrolling
the loops and writing a set_product() routine (or macro) to get a
sense of where I'm going here.

Hope to send response to send part sometime soon.

You seem to have spotted a pattern in the steps that I didn't - I
looked at the mask values, and decided it wasn't worth working out
which of the 8 possible sequences to use - remembering which was too expensive. But it's also (I think, I haven't coded it) to work out
whether you need to step 2, 4, 8 times the prime. And that would save
more memory.

The possible steps for mod30 are 2 or 4 or 6 times the prime.

Another thought BTW - whether it's worth storing with a larger modulo
- say 2*3*5*7 - not just 2*3*5 is a different decision from deciding
whether the steps between primes should take that into consideration.

My guess is that the possible steps are still just 2 or 4 or 6
(times the prime), but I haven't checked that.

Now here is revised code where the unrolling has been done, and also
uses optimizable mask production:

void
mod30_sieve_x( Index N, NN limit, Index i_limit ){
Index i;

bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Bits cbits = bytes[i];
Bits pbits = cbits ^ -1;
if( pbits & 0001 ) remove_multiples_x( N, i, 1, pbits );
if( pbits & 0002 ) remove_multiples_x( N, i, 7, pbits );
if( pbits & 0004 ) remove_multiples_x( N, i, 11, pbits );
if( pbits & 0010 ) remove_multiples_x( N, i, 13, pbits );
if( pbits & 0020 ) remove_multiples_x( N, i, 17, pbits );
if( pbits & 0040 ) remove_multiples_x( N, i, 19, pbits );
if( pbits & 0100 ) remove_multiples_x( N, i, 23, pbits );
if( pbits & 0200 ) remove_multiples_x( N, i, 29, pbits );
}
}

#define set_product_x( i, a, j, b ) ( \
bytes[ i*j*30 + i*b + j*a + a*b/30 ] |= BIT_FOR( (a*b%30) ) \
)

#define BIT_FOR( m ) ( \
m == 1 ? 0001 : m == 7 ? 0002 : m == 11 ? 0004 : m == 13 ? 0010 : \
m == 17 ? 0020 : m == 19 ? 0040 : m == 23 ? 0100 : m == 29 ? 0200 : \
0 \ )

void
remove_multiples_x( Index N, Index i, NN a, Bits ibits ){
NN p = 30*i + a;
Index j;
Index jn = (N*30/p + 29)/30;

switch( a ){
case 1: set_product_x( i, a, i, 1 );
case 7: set_product_x( i, a, i, 7 );
case 11: set_product_x( i, a, i, 11 );
case 13: set_product_x( i, a, i, 13 );
case 17: set_product_x( i, a, i, 17 );
case 19: set_product_x( i, a, i, 19 );
case 23: set_product_x( i, a, i, 23 );
case 29: set_product_x( i, a, i, 29 );
}

for( j = i+1; j < jn; j++ ){
set_product_x( i, a, j, 1 );
set_product_x( i, a, j, 7 );
set_product_x( i, a, j, 11 );
set_product_x( i, a, j, 13 );
set_product_x( i, a, j, 17 );
set_product_x( i, a, j, 19 );
set_product_x( i, a, j, 23 );
set_product_x( i, a, j, 29 );
}
}

Some comments about the upper bounds of the loops:

First, assume that 'i_limit' has been computed in conjunction with
the determination of the parameter N so that it is exactly right.

Second, the initializing expression for 'jn' is safe but may miss
part of one more value of j. There are different ways to take care
of that, but it's a minor detail so I'm skipping over it for now.
Of course feel free to think about how that could be done if you
want to, and if not then no worries, I expect to say more about that
in a later followup.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Tim Rentsch on Sun Aug 18 19:52:00 2024

Tim Rentsch <[email protected]> writes:

Vir Campestris <[email protected]d> writes:

[...]

You seem to have spotted a pattern in the steps that I didn't - I
looked at the mask values, and decided it wasn't worth working out
which of the 8 possible sequences to use - remembering which was
too expensive. But it's also (I think, I haven't coded it) to work
out whether you need to step 2, 4, 8 times the prime. And that
would save more memory.

The possible steps for mod30 are 2 or 4 or 6 times the prime.

Another thought BTW - whether it's worth storing with a larger
modulo - say 2*3*5*7 - not just 2*3*5 is a different decision from
deciding whether the steps between primes should take that into
consideration.

My guess is that the possible steps are still just 2 or 4 or 6
(times the prime), but I haven't checked that.

I guessed wrong. Considered mod 210, there are

15 steps of 2
15 steps of 4
14 steps of 6
2 steps of 8
2 steps of 10

(for a total of 48 residues) due to knocking out multiples of 7.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Mon Aug 19 21:23:22 2024

On 16/08/2024 18:35, Bonita Montero wrote:
<snip>

But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.

It's not just storage you save, it's also computation.

That program I published up-thread is almost as fast as yours - within
10% of the elapsed time - while only using a single core. The amount of
CPU used is much lower.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Mon Aug 19 21:34:19 2024

On 16/08/2024 18:35, Bonita Montero wrote:

But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.

BTW I was just checking outputs.

66049
67591
69133
69647
71189
72217
72731
75301
78899
79927
80441
81469
85067
86609
89179
89693
90721
92263
94319
95861
97403
98431
99973

all show in the output for your program, but not mine. I think you'll
find they are products of 257 and the next few primes.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From red floyd@21:1/5 to All on Mon Aug 19 21:08:11 2024

So I'm a little late, but here's my effort to use the modulo 30 trick.

Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5 5600x, 64GB RAM

g++ -O3 -std=c++17
5761455 primess less than 100 million in 0.182269s
50847534 primes less than 1 billion in 2.841167s
455052511 primes less than 10billion in 53.009133s

-- CUT HERE ---

#include <iostream>
#include <chrono>
#include <stdint.h>
#include <vector>
#include <tuple>
#include <string>
#include <iomanip>

using inttype_t = uint64_t;
using settype_t = unsigned char;

class Sieve {
static constexpr inttype_t MOD_VALUE { 30ull };
// 1, 7, 11, 13, 17, 19, 23, 29
static constexpr settype_t masks[] = {
0x00, 0x01, 0x00, 0x00, 0x00, 0x00, // 1
0x00, 0x02, 0x00, 0x00, 0x00, 0x04, // 7, 11
0x00, 0x08, 0x00, 0x00, 0x00, 0x10, // 13, 17
0x00, 0x20, 0x00, 0x00, 0x00, 0x40, // 19, 23
0x00, 0x00, 0x00, 0x00, 0x00, 0x80 // 29
};
inttype_t max_val;
std::vector<settype_t> sieve;
inttype_t nprimes;

using lookup_t = std::tuple<inttype_t, settype_t>;

inline lookup_t lookup(inttype_t val)
{
return std::make_tuple(val / MOD_VALUE, masks[val % MOD_VALUE]);
}

public:
inline bool is_prime(inttype_t val)
{
const auto [index, mask] = lookup(val);
return static_cast<bool>(sieve[index] & mask);
}

inline void mark_prime(inttype_t val)
{
++nprimes;
const inttype_t incr = val + val;
for (inttype_t j = val + incr; j <= max_val; j += incr)
{
const auto [index, mask] = lookup(j);
sieve[index] &= ~mask;
}
}

inttype_t get_prime_count() { return nprimes; }

Sieve(inttype_t max_)
: max_val(max_)
, sieve((max_ / MOD_VALUE) + 1, ~settype_t())
, nprimes(3)
{
}
};

int main(int argc, char *argv[])
{
std::vector<std::string> args(argv, argv+argc);
constexpr inttype_t default_max = 100ull;
inttype_t max_val = default_max;
if (argc > 1)
{
max_val = std::stoull(args[1]);
if (max_val == 0)
max_val = default_max;
}

Sieve sieve(max_val);

const auto start = std::chrono::steady_clock::now();
for (inttype_t j = 7 ; j <= max_val ; j += 2)
{
if (sieve.is_prime(j))
{
sieve.mark_prime(j);
}
}
const auto finish = std::chrono::steady_clock::now();
const auto diff =
std::chrono::duration_cast<std::chrono::microseconds>(finish -
start);
constexpr uint64_t us_per_sec = 1'000'000;
auto us = diff.count();

std::cout << sieve.get_prime_count()
<< " primes found less than "
<< max_val
<< " in "
<< (us / us_per_sec)
<< "."
<< std::setw(6)
<< std::setfill('0')
<< (us % us_per_sec)
<< "s\n";
return 0;
}
-- CUT HERE ---

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Tue Aug 20 17:55:33 2024

On 20/08/2024 16:24, Bonita Montero wrote:

Am 20.08.2024 um 17:21 schrieb Bonita Montero:

Am 19.08.2024 um 22:23 schrieb Vir Campestris:

On 16/08/2024 18:35, Bonita Montero wrote:
<snip>

But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.

It's not just storage you save, it's also computation.

That program I published up-thread is almost as fast as yours -
within 10% of the elapsed time - while only using a single core. The
amount of CPU used is much lower.

Andy

On my AMD 7990X all primes up to 2 ^ 32 are calculated with my code
with a 65W-setting of the CPU in 0.168s. The total CPU-time is 3.688s.
With your code all primes up to 2 ^ 32 take nearly exactly four seconds
with a single thread. So the overall computation time with my algorithm
is about seconds less.

And if I run my code with a single core:

    C:\Users\Boni\Documents\Source\bitmapSieve>timep "x64\Release-clang++\bitmapSieve.exe" 0x100000000 "" 1
    real   1.795s
    user   1.766s
    sys    0.000s
    cylces 8.011.804.500

Thats less than half the computation time.

That's really odd. I too have an AMD - although my AMD Ryzen 5 3400G is
chosen for low power, not speed. Surely it can't be because I use Linux,
not Windows?

I'm leaning towards thinking you may have a point on the mod30 code.
There are more operations on the innermost loop with mod30, although the
loop goes around fewer times. It also means that the store is forced to
be byte wide - I find that my original odd-only code is significantly
faster - about 30% - when the store is 64 bit wide rather than byte.

Speed of writing the primes out? I don't care. The only reason I put it
in there was to allow me to check the output. If I did care I'd be
putting a more efficient enumeration function too.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to red floyd on Tue Aug 20 21:14:30 2024

On 20/08/2024 05:08, red floyd wrote:

So I'm a little late, but here's my effort to use the modulo 30 trick.

Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5 5600x, 64GB RAM

g++ -O3 -std=c++17
5761455 primess less than 100 million in 0.182269s
50847534 primes less than 1 billion in 2.841167s
455052511 primes less than 10billion in 53.009133s

It's slow.

Looking quickly at the (remarkably small) code I see line 30:

return std::make_tuple(val / MOD_VALUE, masks[val % MOD_VALUE]);

That mod and div are being called for every single time you mark a prime
in the bitmap.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Bonita Montero on Tue Aug 20 20:50:53 2024

Bonita Montero <[email protected]> writes:

Am 19.08.2024 um 22:34 schrieb Vir Campestris:

I wrote a small program

Most people would use the standard command line
tools to do this (e.g. sort, uniq and diff).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Scott Lurndal on Thu Aug 22 17:30:40 2024

On 20/08/2024 21:50, Scott Lurndal wrote:

Bonita Montero <[email protected]> writes:

Am 19.08.2024 um 22:34 schrieb Vir Campestris:

I wrote a small program

Most people would use the standard command line
tools to do this (e.g. sort, uniq and diff).

She's on Windows :P

I just tried again to check my sanity. And decided neither of us are
crazy. It's sensitive to the number of primes you ask for.

$ ./Bonita 100 Bonita.txt && grep -w 66049 Bonita.txt
$ ./Bonita 1000 Bonita.txt && grep -w 66049 Bonita.txt
$ ./Bonita 10000 Bonita.txt && grep -w 66049 Bonita.txt
$ ./Bonita 100000 Bonita.txt && grep -w 66049 Bonita.txt
$ ./Bonita 1000000 Bonita.txt && grep -w 66049 Bonita.txt
$ ./Bonita 10000000 Bonita.txt && grep -w 66049 Bonita.txt
$ ./Bonita 100000000 Bonita.txt && grep -w 66049 Bonita.txt
$ ./Bonita 1000000000 Bonita.txt && grep -w 66049 Bonita.txt
$ ./Bonita 10000000000 Bonita.txt && grep -w 66049 Bonita.txt
66049
$ ./Bonita 100000000000 Bonita.txt && grep -w 66049 Bonita.txt
66049
$

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Thu Aug 22 21:47:14 2024

On 22/08/2024 17:38, Bonita Montero wrote:

I did "findstr /x "66049" primes.txt" on my output and it didn't
find that number with 1E10 as its maximum.

Eppur si muove.

I have no idea what is going on here. I even checked that I get the same
result using clang instead of g++.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Thu Aug 22 21:56:32 2024

On 16/08/2024 18:35, Bonita Montero wrote:

But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.

I've been running some experiments.

Skipping evens only is nice and simple on computation; that's good.

Skipping something else requires table lookups. I've knocked up some
code that uses the correct table for skipping primes (but not for mask
bit selection) and run it for 2*3*5, 2*3*5*7, and 2*3*5*7*11.

Only the last one is faster than just skipping evens. And it's not much
faster at the price of much more complexity, and a lot more store use.

My hack will only work with a prime which is 1 modulo(2*3*5*7*11) and it
would take a lot more code to make it work for all the other modulo
values. And even more if I was to make it work out the bitmaps.

I think we've done this to death now (and still with no C++ content!).

I'm happy with my algorithm:
- Work out a bitmask for some small primes, only as far as is needed to
get repeats (the repeat length is the product of the primes)
- Work out bitmasks for the next half a dozen. Each of those will have a
repeat the length of the prime.
- Combine those together into the output bitmap. That requires fewer
operations than doing each one into the output bitmap
- Process the larger primes. Do the output bitmap in blocks for cache friendliness
Note that the larger the prime the less time it takes.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From red floyd@21:1/5 to Vir Campestris on Thu Aug 22 17:00:42 2024

On 8/22/2024 1:56 PM, Vir Campestris wrote:

On 16/08/2024 18:35, Bonita Montero wrote:

But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.

I've been running some experiments.

Skipping evens only is nice and simple on computation; that's good.

Skipping something else requires table lookups. I've knocked up some
code that uses the correct table for skipping primes (but not for mask
bit selection) and run it for 2*3*5, 2*3*5*7, and 2*3*5*7*11.

You can skip multiples of three, by starting with 7, and then
alternately adding 4 then 2.

e.g.

incr = 2;
for (val = 7 ; val < MAX_PRIME_TO_CHECK ; val += incr)
{
check_for_prime(val);
incr = 6 - incr;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to red floyd on Mon Aug 26 08:31:31 2024

red floyd <[email protected]d> writes:

So I'm a little late, but here's my effort to use the modulo 30
trick.

Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5 5600x,
64GB RAM

g++ -O3 -std=c++17
5761455 primess less than 100 million in 0.182269s
50847534 primes less than 1 billion in 2.841167s
455052511 primes less than 10billion in 53.009133s

I appreciate both the effort and your posting of results.

[..program..]

There are several ways that the program is doing noticeably more
work than it needs to. I haven't done any measurements but I
believe this extra work accounts for the biggest part of its
relative slowness. Here is a short simple program to illustrate
some ways that your program could be sped up.

(The short simple program is the rest of this posting.)

#include <stdio.h>
#include <stdlib.h>

typedef unsigned long long Index, Count, Bits, NN, N235;

static const unsigned char mods[8] = { 1, 7, 11, 13, 17, 19, 23, 29, }; static unsigned char bytes[ 1ULL * 50 * 1000 * 1000 * 1000 ];

static void mod30_sieve( NN );
static NN product_NN( N235, N235 );
static _Bool is_marked_composite( N235 );
static void mark_composite( NN );
static Count unmarked_less_than( NN );

int
main( int argc, char *argv[] ){
NN ceiling = argc > 1 ? strtoull( argv[1], 0, 10 ) : 1000000000;

setbuf( stdout, 0 );
if( ceiling >= 30 * sizeof bytes ){
printf( " urk.. limit must be less than %zu\n", 30 * sizeof bytes );
exit( 0 );
}

printf( " determining primes less than %llu ...\n", ceiling );
mod30_sieve( ceiling );
printf( " ... found %llu primes.\n", unmarked_less_than( ceiling ) );

return 0;
}

void
mod30_sieve( NN ceiling ){
N235 i, j;
NN ijNN;

for( bytes[0] = 1, i = 0; product_NN( i, i ) < ceiling; i++ ){
if( is_marked_composite( i ) ) continue;

for( j = i; ijNN = product_NN( i, j ), ijNN < ceiling; j++ ){
mark_composite( ijNN );
}
}
}

NN
product_NN( N235 x, N235 y ){
NN xnn = (x>>3)*30 + mods[x&7];
NN ynn = (y>>3)*30 + mods[y&7];

return xnn * ynn;
}

_Bool
is_marked_composite( N235 t ){
return bytes[t>>3] & 1<<(t&7);
}

void
mark_composite( NN nn ){
static const unsigned char masks[] = {
[1]= 1, [7]= 2, [11]= 4, [13]= 8, [17]= 16, [19]= 32, [23]=64, [29]=128,
};
bytes[ nn/30 ] |= masks[ nn%30 ];
}

Count
unmarked_less_than( NN ceiling ){
Index i = ceiling/30;
NN extra = ceiling - i*30;
Count r = (ceiling > 2) + (ceiling > 3) + (ceiling > 5);

for( Index j = 0; j < 8 && mods[j] < extra; j++ ){
r += bytes[i]>>j &1 ^1;
}

while( i && i-- ){
for( Bits b = bytes[i] ^0xFF; b; b >>= 1 ) r += b&1;
}

return r;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Mon Aug 26 09:35:55 2024

Vir Campestris <[email protected]d> writes:

On 20/08/2024 05:08, red floyd wrote:

So I'm a little late, but here's my effort to use the modulo
30 trick.

Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5
5600x, 64GB RAM

g++ -O3 -std=c++17
5761455 primess less than 100 million in 0.182269s
50847534 primes less than 1 billion in 2.841167s
455052511 primes less than 10billion in 53.009133s

It's slow.

Looking quickly at the (remarkably small) code I see line 30:

return std::make_tuple(val / MOD_VALUE, masks[val % MOD_VALUE]);

That mod and div are being called for every single time you
mark a prime in the bitmap.

That is a cost but I'm pretty sure it's not the most important
aspect.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to red floyd on Mon Aug 26 10:59:33 2024

red floyd <[email protected]d> writes:

On 8/22/2024 1:56 PM, Vir Campestris wrote:

On 16/08/2024 18:35, Bonita Montero wrote:

But basically I don't think it is a good idea to skip numbers exept
multiples of two. With the three you save a sixth of memory, with
the five you save a 15-th and at the end you get about 20% less
storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
for a lot of computation. That's the point where I dropped this
idea and I think this extra computation is higher than the time
for the saved memory loads.

I've been running some experiments.

Skipping evens only is nice and simple on computation; that's good.

Skipping something else requires table lookups. I've knocked up some
code that uses the correct table for skipping primes (but not for
mask bit selection) and run it for 2*3*5, 2*3*5*7, and 2*3*5*7*11.

You can skip multiples of three, by starting with 7, and then
alternately adding 4 then 2.

e.g.

incr = 2;
for (val = 7 ; val < MAX_PRIME_TO_CHECK ; val += incr)
{
check_for_prime(val);
incr = 6 - incr;
}

As a matter of style I would normally fold the initialization and
update of 'incr' into the first and third expressions of the for()
loop control.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Mon Aug 26 12:08:11 2024

Vir Campestris <[email protected]d> writes:

On 20/08/2024 16:24, Bonita Montero wrote:

Am 20.08.2024 um 17:21 schrieb Bonita Montero:

Am 19.08.2024 um 22:23 schrieb Vir Campestris:

On 16/08/2024 18:35, Bonita Montero wrote:
<snip>

But basically I don't think it is a good idea to skip numbers
exept multiples of two. [...] I think this extra computation
is higher than the time for the saved memory loads.

[...]

I'm leaning towards thinking you may have a point on the mod30
code. There are more operations on the innermost loop with mod30,
although the loop goes around fewer times. It also means that the
store is forced to be byte wide - I find that my original odd-only
code is significantly faster - about 30% - when the store is 64
bit wide rather than byte. [...]

One motivation for choosing a mod30 representation is being able to
compute more primes -- almost twice as many. Any machine with 64GB
of memory should be able to compute primes up to 1.5 trillion.
Even if an odds-only representation runs faster for more limited
sizes (and I'm not yet convinced that it does), it doesn't matter
if it's faster, because it doesn't do the job. I was computing all
32-bit primes 20 years ago, back when 32-bit machines were a thing.
The challenge for today's world is correspondingly higher.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Mon Aug 26 12:47:56 2024

Vir Campestris <[email protected]d> writes:

On 16/08/2024 18:35, Bonita Montero wrote:

But basically I don't think it is a good idea to skip numbers
exept multiples of two. [...]

I've been running some experiments.

Skipping evens only is nice and simple on computation; that's
good.

Skipping something else requires table lookups. [...]

It doesn't have to. One point of the code I've been posting is
that table lookups, including mask productions, are eliminated;
everything gets optimized away so what's left is all compile-time
constants.

I'm happy with my algorithm:
- Work out a bitmask for some small primes, only as far as is
needed to get repeats (the repeat length is the product of the
primes)
- Work out bitmasks for the next half a dozen. Each of those will
have a repeat the length of the prime.
- Combine those together into the output bitmap. That requires
fewer operations than doing each one into the output bitmap
- Process the larger primes. Do the output bitmap in blocks for
cache friendliness

My experience:

* Special casing primes between 7 and 29 helps (and because it's
the first step the results can initialize the map);

* Special purpose code where primes are taken in pairs, for the
set of primes more than 29 and less than 360, with each pair
being folded into the initialized map, helps (incidentally,
that is 31 pairs, or 62 primes total);

* The combination of those two steps gives a saving of about 35%;

* Doing striping (the term I use for working in cache-sized
blocks) seems not to help much, so I don't use striping except
incidentally in the prime-pairs step;

* Forcing the function for the main second-level loop to be
inlined saves another 15% or so;

* I haven't tried to measure how the speed-up tricks scale with
overall number of primes to calculate. My guess is that they
are close to linear but I don't have any actual measurements.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Bonita Montero on Sun Sep 1 21:23:04 2024

On 27/08/2024 05:09, Bonita Montero wrote:

Am 26.08.2024 um 21:08 schrieb Tim Rentsch:

One motivation for choosing a mod30 representation is being able to
compute more primes -- almost twice as many. Any machine with 64GB
of memory should be able to compute primes up to 1.5 trillion.

You don't need to hold all primes in memory. Just calculate the primes
up to the square root and then you can calculate every segment up to
the maximum from that.

So you compute all the primes up to the square root of a larger number.

I was cleaning up my odds-only code with a view to posting it, when I
had an idea.

Take 101 (because I can write things more easily).

I want to mark

10201,10302,10403,10504,10605,10706...

With odds only that becomes
10201,10403,10605,10807,11009,11211...

and as the loop increment is a constant it is nice and fast. With mod30
it isn't a constant, and it hurts to work out the increment.

Except...

If I mark 10201,13231,16261,19291,22321
with an increment of 30*p,

then go back to 10201, add 2p, then increment that by 30
10201 13231 16261
and do that for the other 6 of the 8, my inner loop has a constant
increment, and my code is compatible with the mod30 store.

That might be quicker. I also have some thoughts over storing the prime
as two parts. It's 30m+r, where m is modulus and r remainder. r maps one-for-one into the byte mask, and m is the index.

I'm going to waste lots more time on this I can see!

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Sun Sep 1 20:40:59 2024

Vir Campestris <[email protected]d> writes:

On 27/08/2024 05:09, Bonita Montero wrote:

Am 26.08.2024 um 21:08 schrieb Tim Rentsch:

One motivation for choosing a mod30 representation is being able to
compute more primes -- almost twice as many. Any machine with 64GB
of memory should be able to compute primes up to 1.5 trillion.

You don't need to hold all primes in memory. Just calculate the primes
up to the square root and then you can calculate every segment up to
the maximum from that.

This is a brainless statement (and incidentally illustrates why I
was motivated not to read postings from BM). In fact no primes need
be pre-calculated to determine whether any given number is prime.
The point of using a sieve is the sieve method is much faster than
not using one. For example, suppose we want to determine all primes
less than a trillion. Taking the approach of pre-computing only
those primes less than a million (the square root) and then testing
numbers individually takes more than 100 times as many operations as
using a sieve. Furthermore the operations used are more expensive
for the non-sieve approach - simply setting a bit in the case of a
sieve, versus computing a remainder in the non-sieve case.

So you compute all the primes up to the square root of a larger number.

I was cleaning up my odds-only code with a view to posting it, when I
had an idea.

Take 101 (because I can write things more easily).

I want to mark

10201,10302,10403,10504,10605,10706...

With odds only that becomes
10201,10403,10605,10807,11009,11211...

Right. Marking of multiples needs to start only at p*p, not
at p+2.

and as the loop increment is a constant it is nice and fast. With
mod30 it isn't a constant, and it hurts to work out the increment.

Except...

If I mark 10201,13231,16261,19291,22321
with an increment of 30*p,

then go back to 10201, add 2p, then increment that by 30
10201 13231 16261
and do that for the other 6 of the 8, my inner loop has a constant
increment, and my code is compatible with the mod30 store.

Yes, reordering the loops this way might very well make things
work better. It also has the property, I believe, that the
bit to set is the same in each of the eight outer loops, and
so can be computed only once for each of the eight outer loops.
Very good!

That might be quicker. I also have some thoughts over storing the
prime as two parts. It's 30m+r, where m is modulus and r remainder. r
maps one-for-one into the byte mask, and m is the index.

This is what I've been saying all along!

I'm going to waste lots more time on this I can see!

I'm interested to see what you come up with.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Tim Rentsch on Tue Sep 3 17:45:32 2024

On 02/09/2024 04:40, Tim Rentsch wrote:

Vir Campestris <[email protected]d> writes:

On 27/08/2024 05:09, Bonita Montero wrote:

Am 26.08.2024 um 21:08 schrieb Tim Rentsch:

One motivation for choosing a mod30 representation is being able to
compute more primes -- almost twice as many. Any machine with 64GB
of memory should be able to compute primes up to 1.5 trillion.

You don't need to hold all primes in memory. Just calculate the primes
up to the square root and then you can calculate every segment up to
the maximum from that.

This is a brainless statement (and incidentally illustrates why I
was motivated not to read postings from BM). In fact no primes need
be pre-calculated to determine whether any given number is prime.
The point of using a sieve is the sieve method is much faster than
not using one. For example, suppose we want to determine all primes
less than a trillion. Taking the approach of pre-computing only
those primes less than a million (the square root) and then testing
numbers individually takes more than 100 times as many operations as
using a sieve. Furthermore the operations used are more expensive
for the non-sieve approach - simply setting a bit in the case of a
sieve, versus computing a remainder in the non-sieve case.

Umm. I assumed you'd take the primes you had, then sieve them into
several blocks to exceed your memory size.

So you compute all the primes up to the square root of a larger number.

I was cleaning up my odds-only code with a view to posting it, when I
had an idea.

Take 101 (because I can write things more easily).

I want to mark

10201,10302,10403,10504,10605,10706...

With odds only that becomes
10201,10403,10605,10807,11009,11211...

Right. Marking of multiples needs to start only at p*p, not
at p+2.

and as the loop increment is a constant it is nice and fast. With
mod30 it isn't a constant, and it hurts to work out the increment.

Except...

If I mark 10201,13231,16261,19291,22321
with an increment of 30*p,

then go back to 10201, add 2p, then increment that by 30
10201 13231 16261
and do that for the other 6 of the 8, my inner loop has a constant
increment, and my code is compatible with the mod30 store.

Yes, reordering the loops this way might very well make things
work better. It also has the property, I believe, that the
bit to set is the same in each of the eight outer loops, and
so can be computed only once for each of the eight outer loops.
Very good!

It ought to be fast. Perhaps I'll have something one day. But not before
I go on holiday, so it will be a few weeks.

That might be quicker. I also have some thoughts over storing the
prime as two parts. It's 30m+r, where m is modulus and r remainder. r
maps one-for-one into the byte mask, and m is the index.

This is what I've been saying all along!

Ah, sorry.

I'm going to waste lots more time on this I can see!

I'm interested to see what you come up with.

The code that follows is *NOT* doing anything with mod30, it's just a
cleaned up, not using include files, version of my odds only program.

The code that writes the primes out is not optimised. Not even slightly.
And it's way slower than Bonita's version.

But the code that does the calculation is a lot faster:

$ time ./Bonita 100000000000 && time ./usenet 100000000000

real 1m46.452s
user 2m53.666s
sys 0m32.260s

781250001,40.4744
594,0.0709076
598,2.60332
620,2.60697
628,20.4567
620,20.4572
628,40.4744
4294967295,40.4744

real 0m40.723s
user 0m39.104s
sys 0m1.604s

Andy
--
/*
Programme to calculate primes using the Sieve or Eratosthenes
Copyright (C) 2024 the person known as Vir Campestris

This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
*/

#include <algorithm>
#include <cassert>
#include <chrono>
#include <climits>
#include <cmath>
#include <csignal>
#include <cstring>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <list>
#include <memory>
#include <new>
#include <thread>
#include <vector>

// If these trigger you have a really odd architecture. static_assert(sizeof(uint8_t) == 1);
static_assert(CHAR_BIT == 8);

// The type of a prime number.
// size_t isn't big enough on a 32-bit machine with 2GB memory allocated
and 16 primes per byte.
typedef uint64_t PRIME;

// This is the type of the word I operate in. The code will run OK with
uin8_t or uint32_t,
// but *on my computer* it's slower.
typedef uint64_t StoreType;

// tuneable constants
constexpr size_t tinyCount = 5;
constexpr size_t smallCount = 10;
constexpr size_t bigCache = 16384;

// this is a convenience class for timing.
class Stopwatch {
typedef std::chrono::steady_clock CLOCK;
typedef std::pair<unsigned, std::chrono::time_point<CLOCK> > LAP;
typedef std::vector<LAP> LAPS;
LAPS mLaps;
public:
typedef std::vector<std::pair<unsigned, double>> LAPTIMES;
Stopwatch() {}

// record the start of an interval
void start()
{
mLaps.clear();
mLaps.emplace_back(std::make_pair(0, CLOCK::now()));
}

void lap(unsigned label)
{
mLaps.emplace_back(std::make_pair(label, CLOCK::now()));
}

// record the end of an interval
void stop()
{
mLaps.emplace_back(std::make_pair(-1, CLOCK::now()));
}

// report the difference between the last start and stop
double delta() const
{
assert(mLaps.size() >= 2);
assert(mLaps.front().first == 0);
assert(mLaps.back().first == -1);
return std::chrono::duration_cast<std::chrono::nanoseconds>(mLaps.back().second
- mLaps.front().second).count() / (double)1e9;
}

LAPTIMES const laps() const
{
LAPTIMES ret;
assert(mLaps.size() >= 2);
ret.reserve(mLaps.size() - 1);
auto lap = mLaps.begin();
auto start = lap->second;

for(++lap; lap != mLaps.end(); ++lap)
{
ret.emplace_back(
std::make_pair(
lap->first,

std::chrono::duration_cast<std::chrono::nanoseconds>(lap->second - start).count() / (double)1e9));

}
return ret;
}
};

// Internal class used to store the mask data for one or more
// primes rather than all primes. This has an initial block,
// followed by a block which contains data which is repeated
// for all higher values.
// Example: StoreType== uint8_t, prime == 3, the mask should be
// 90 meaning 1,3,5,7,11,13 are prime but not 9, 15
// 24,49,92 covers odds 17-63
// 24,49,92 covers 65-111, with an identical mask
// As the mask is identical we only need to store the
// non-repeat, plus one copy of the repeated block and data to
// show where it starts and ends. Note that for big primes the
// initial block may be more than one StoreType.
// As there are no common factors between any prime and the
// number of bits in StoreType the repeat is the size of the
// prime.
// Note also that a masker for two primes will have an
// initial block which is as big as the larger of the sources,
// and a repeat which is the product of the sources.
class Masker final
{
// The length of the initial block.
size_t initSize;

// The size of the repeated part. The entire store size is the
sum of these two.
size_t repSize;

// Buffer management local class. I started using std::vector
for the store,
// but I found it was slow for large sizes as it insists on
calling the constructor
// for each item in turn. And there may be billions of them.
class Buffer final
{
StoreType* mpBuffer; // base of the data referred to
size_t mSize; // size of the buffer in
StoreType-s.

public:
// Normal constructor
Buffer(size_t size = 0) : mpBuffer(nullptr), mSize(size)
{
if (size > 0)
{
mpBuffer = static_cast<StoreType*>(std::malloc(size * sizeof(StoreType)));

if (!mpBuffer)
throw std::bad_alloc();
}
}

Buffer(Buffer&) = delete;

// Move constructor
Buffer(Buffer&& source)
: mpBuffer(source.mpBuffer), mSize(source.mSize)
{
source.mSize = 0;
source.mpBuffer = nullptr;
};

Buffer& operator=(Buffer&) = delete;

// Move assignment
Buffer& operator=(Buffer&& source)
{
if (mpBuffer)
std::free(mpBuffer);
mpBuffer = source.mpBuffer;
mSize = source.mSize;
source.mSize = 0;
source.mpBuffer = nullptr;
return *this;
}

void resize(size_t newSize)
{
mpBuffer = static_cast<StoreType*>(std::realloc(mpBuffer, newSize *
sizeof(StoreType)));
if (!mpBuffer)
throw std::bad_alloc();
mSize = newSize;
}

// Get the data start
inline StoreType* operator*() const
{
return mpBuffer;
}

// Get the owned size
inline size_t const & size() const
{
return mSize;
}

// clean up.
~Buffer()
{
if (mpBuffer)
std::free(mpBuffer);
}
} mBuffer;

// Raw pointer to the store for speed.
StoreType * mStorePtr;

// shift from the prime value to the index in mStore
static constexpr size_t wordshift = sizeof(StoreType) == 1 ? 4
: sizeof(StoreType) == 4 ? 6 : sizeof(StoreType) == 8 ? 7 : 0;
static_assert(wordshift);

// Mask for the bits in the store that select a prime.
static constexpr size_t wordmask = (StoreType)-1 >>
(sizeof(StoreType) * CHAR_BIT + 1 - wordshift);

// convert a prime to the index of a word in the store
static inline size_t index(PRIME const & prime)
{
return prime >> wordshift;
}

// get a mask representing the bit for a prime in a word
static inline StoreType mask(PRIME const & prime)
{
return (StoreType)1 << ((prime >> 1) & wordmask);
}

public:
// This class allows us to iterate through a Masker
indefinitely, retrieving words,
// automatically wrapping back to the beginning of the repat
part when the end is met.
class MaskIterator
{
StoreType const * index, *repStart, *repEnd;
public:
MaskIterator(Masker const * origin)
: index(origin->mStorePtr)
, repStart(index + origin->initSize)
, repEnd(repStart + origin->repSize)
{}

inline StoreType const & operator*() const
{
return *index;
}

inline StoreType next() // returns value with iterator post-increment
{
auto & retval = *index;
if (++index >= repEnd)
index = repStart;
return retval;
}
};

// Default constructor. Makes a dummy mask for overwriting.
Masker()
: initSize(0)
, repSize(1)
, mBuffer(1)
, mStorePtr(*mBuffer)
{
*mStorePtr = 0; // nothing marked unprime
}

// Move constructor (destroys source)
Masker(Masker&& source)
: initSize(source.initSize)
, repSize(source.repSize)
, mBuffer(std::move(source.mBuffer))
, mStorePtr(*mBuffer)
{
}

// Copy constructor (preserves source)
Masker(Masker& source)
: initSize(source.initSize)
, repSize(source.repSize)
, mBuffer(initSize + repSize)
, mStorePtr(*mBuffer)
{
memcpy(*mBuffer, *source.mBuffer, mBuffer.size() * sizeof(StoreType));
}

// move assignment (destroys source)
Masker& operator=(Masker&& source)
{
initSize = source.initSize;
repSize = source.repSize;
mStorePtr = source.mStorePtr;
mBuffer = std::move(source.mBuffer);
return *this;
}

// Construct for a single prime
Masker(PRIME prime)
: initSize(this->index(prime)+1)
, repSize(prime)
, mBuffer(initSize + prime)
, mStorePtr(*mBuffer)
{
// Pre-zero the buffer, because for bigger primes we
don't write every word.
// This is fast enough not to care about excess writes.
memset(mStorePtr, 0, mBuffer.size() * sizeof(StoreType));

// The max prime we'll actually store.
// *2 because we don't store evens.
PRIME const maxPrime = (initSize + repSize) * CHAR_BIT
* sizeof(StoreType) * 2;

// Filter all the bits. Note that although we
// don't strictly need to filter from prime*3,
// but only from prime**2, doing from *3 makes
// the init block smaller.
PRIME const prime2 = prime * 2;

for (PRIME value = prime * 3; value < maxPrime; value
+= prime2)
{
mStorePtr[this->index(value)] |= this->mask(value);
}
}

// Construct from two others, making a mask that excludes all
numbers
// marked not prime in either of them. The output init block
// will be the largest from either of the inputs; the repeat
block size will be the
// product of the input repeat sizes.
// The output could get quite large - 3.7E12 for all primes up
to 37
Masker(const Masker& left, const Masker& right, size_t
storeLimit = 0)
: initSize(std::max(left.initSize, right.initSize))
, repSize (left.repSize * right.repSize)
, mBuffer()
{
if (storeLimit)
repSize = std::min(initSize + repSize,
storeLimit) - initSize;

auto storeSize = initSize + repSize;

// Actually construct the store with the desired size
mBuffer = storeSize;
mStorePtr = *mBuffer;

// get iterators to the two inputs. These automatically
wrap
// when their repsize is reached.
auto li = left.begin();
auto ri = right.begin();

for (size_t word = 0; word < storeSize; ++word)
{
mStorePtr[word] = li.next() | ri.next();
}
}

// Construct from several others, making a mask that excludes
all numbers
// marked not prime in any of them. The output init block
// will be the largest from any of the inputs; the repeat size
will be the
// product of all the input repeat sizes.
// The output could get quite large - 3.7E12 for all primes up
to 37
// the type iterator should be an iterator into a collection of Masker-s.
template <typename iterator> Masker(const iterator& begin,
const iterator& end, size_t storeLimit = 0)
: initSize(0)
, repSize(1)
, mBuffer() // empty for now
{
// Iterate over the inputs. We will
// * Determine the maximum init size
// * Determine the product of all the repeat sizes.
// * Record the number of primes we represent.
size_t nInputs = std::distance(begin, end);
std::vector<MaskIterator> iterators;
iterators.reserve(nInputs);

for (auto i = begin; i != end; ++i)
{
initSize = std::max(initSize, i->initSize);
repSize *= i->repSize;
iterators.push_back(i->begin());
}
if (storeLimit)
repSize = std::min(initSize + repSize,
storeLimit) - initSize;
auto storeSize = initSize + repSize;
// Actually construct the store with the desired size
mBuffer = storeSize;
mStorePtr = *mBuffer;

// take the last one off (most efficient to remove)
// and use it as the initial mask value
auto last = iterators.back();
iterators.pop_back();
for (auto word = 0; word < storeSize; ++word)
{
StoreType mask = last.next();
for(auto &i: iterators)
{
mask |= i.next();
}
mStorePtr[word] = mask;
}
}

template <typename iterator> Masker& andequals(size_t
cachesize, iterator begin, iterator end)
{
static constexpr size_t wordshift = sizeof(StoreType)
== 1 ? 4 : sizeof(StoreType) == 4 ? 6 : sizeof(StoreType) == 8 ? 7 : 0;
static constexpr size_t wordmask = (StoreType)-1 >> (sizeof(StoreType) * CHAR_BIT + 1 - wordshift);

struct value
{
PRIME step; // this is the
prime. The step should be 2P, but only half the bits are stored
PRIME halfMultiple; // this is the
current multiple, _without_ the last bit
value(PRIME prime): step(prime), halfMultiple((prime*prime) >> 1) {}
bool operator<(PRIME halfLimit) const
{
return halfMultiple < halfLimit;
}
void next(StoreType * store)
{
static constexpr size_t wsm1 =
wordshift - 1;
auto index = halfMultiple >> wsm1;
auto mask = (StoreType)1 <<
(halfMultiple & wordmask);
*(store + index) |= mask;
halfMultiple += step;
}
};
std::vector<value> values;
values.reserve(std::distance(begin, end));
for (auto prime = begin; prime != end; ++prime)
{
auto p = *prime;
values.emplace_back(p);
}

size_t limit = 0;
do
{
limit = std::min(limit + cachesize, initSize +
repSize + 1);
PRIME halfMaxPrime = (limit * 16 *
sizeof(StoreType)) / 2 + 1;
for (auto & value: values)
{
while (value < halfMaxPrime)
value.next(mStorePtr);
}
}
while (limit < initSize + repSize);

return *this;
}

// duplicates the data up to the amount needed for a prime newSize
void resize(PRIME newSize)
{
size_t sizeWords = this->index(newSize) + 1;
assert(sizeWords > (initSize + repSize)); // not
allowed to shrink!
mBuffer.resize(sizeWords);
mStorePtr = *mBuffer;

auto copySource = mStorePtr + initSize;
do
{
auto copySize = std::min(sizeWords - repSize - initSize, repSize);
auto dest = copySource + repSize;
memcpy(dest, copySource, copySize * sizeof(StoreType));
repSize += copySize;
} while ((initSize + repSize) < sizeWords);
}

// returns true if this mask thinks value is prime.
bool get(PRIME value) const
{
assert(value < sizeof(StoreType) * CHAR_BIT * 2 * mBuffer.size());
auto ret =
(value <= 3)
|| ((value & 1) &&
(mStorePtr[this->index(value)] & this->mask(value)) == 0);

return ret;
}

// Get the beginning of the bitmap.
// Incrementing this iterator can continue indefinitely,
regardless of the bitmap size.
MaskIterator begin() const
{
return MaskIterator(this);
}

size_t repsize() const
{
return repSize;
}

size_t size() const
{
return initSize + repSize;
}

// prints a collection to stderr
void dump(size_t limit = 0) const
{
std::cerr << std::dec << repSize;
std::cerr << std::hex;

if (limit == 0) limit = initSize + repSize;

auto iter = begin();
for (auto i = 0; i < limit; ++i)
{
// cast prevents attempting to print uint8_t as
a char
std::cerr << ',' << std::setfill('0') << std::setw(sizeof(StoreType) * 2) << uint64_t(mStorePtr[i]);
}
std::cerr << std::endl;
std::cerr << std::dec << repSize;

for (auto p = 1; p < limit * 16 * sizeof(StoreType); ++p)
{
if (get(p))
{
std::cerr << ',' << p;
}
}
std::cerr << std::endl;
}
};

int main(int argc, char**argv)
{
// find out if the user asked for a different number of primes
PRIME PrimeCount = 1e9;
if (argc >= 2)
{
std::istringstream arg1(argv[1]);
std::string crud;
arg1 >> PrimeCount >> crud;
if (!crud.empty() || PrimeCount < 3)
{
double isitfloat;
arg1.seekg(0, arg1.beg);
crud.clear();
arg1 >> isitfloat >> crud;
if (!crud.empty() || isitfloat < 3)
{
std::cerr << "Invalid first argument
\"" << argv[1] << "\" Should be decimal number of primes required\n";
exit(42);
}
else PrimeCount = isitfloat;
}
}

std::string outName;
if (argc >= 3)
{
outName = argv[2];
}

Stopwatch s; s.start();

Masker primes;
PRIME nextPrime;
{
nextPrime = 3;
Masker tiny(nextPrime);
std::vector<Masker> smallMasks;
smallMasks.emplace_back(tiny);

// Make a mask for the really small primes, 3 5 7 11
size_t count = 1; // allows for the 3
for ( ; count < tinyCount; ++count)
{
do
{
nextPrime += 2;
} while (!tiny.get(nextPrime));
tiny = Masker(tiny, nextPrime);
}

// that masker for three will be correct up until 5
squared,
// the first non-prime whose smallest root is not 2 or 3
assert (nextPrime < 25 && "If this triggers tinyCount
is too big and the code will give wrong results");

// this limit is too small, but is easy to calculate
and big enough
auto limit = nextPrime*nextPrime;

smallMasks.clear();
for (; count < smallCount; ++count)
{
do
{
nextPrime += 2;
} while (!tiny.get(nextPrime));
smallMasks.emplace_back(nextPrime);
}
assert(nextPrime <= limit && "If this triggers
smallCount is too big and the code will give wrong results");

// Make a single masker for all the small primes. Don't
make it bigger than needed.
auto small = Masker(smallMasks.begin(),
smallMasks.end(), PrimeCount / (2 * CHAR_BIT * sizeof(StoreType)) + 1);
s.lap(__LINE__);

// This is the first step that takes an appreciable
time. 2.5s for primes up to 1e11 on my machine.
primes = Masker(tiny, small, PrimeCount / (2 * CHAR_BIT
* sizeof(StoreType)) + 1);
s.lap(__LINE__);
}

// if the buffer isn't big enough yet expand it by duplicating
early entries.
if (primes.size() < PrimeCount / (2 * CHAR_BIT *
sizeof(StoreType)) + 1)
{
primes.resize(PrimeCount);
s.lap(__LINE__);
}

do
{
std::vector<PRIME> quickies;
PRIME quickMaxPrime = std::min(nextPrime*nextPrime, PRIME(sqrt(PrimeCount) + 1));
do
{
do
{
nextPrime += 2;
} while (!primes.get(nextPrime));
quickies.emplace_back(nextPrime);
} while (nextPrime < quickMaxPrime);
s.lap(__LINE__);

// this function adds all the quickies into the mask,
but does all of them in
// one area of memory before moving onto the next. The
area of memory,
// and the list of primes, both fit into the L1 cache.
// You may need to adjust bigCache for best performance.
// This step takes a while - two lots of 20s for qe11
primes.
primes.andequals(bigCache, quickies.begin(),
quickies.end());
s.lap(__LINE__);
} while (nextPrime*nextPrime < PrimeCount);

s.stop();
std::cerr << primes.size() << "," << s.laps().back().second << std::endl;
for (auto l: s.laps()) std::cerr << l.first << "," << l.second
<< std::endl;

if (outName.length())
{
std::ofstream outfile(outName);
outfile << "2\n";
for (PRIME prime = 3; prime < PrimeCount; prime += 2)
{
if (primes.get(prime))
{
outfile << prime << '\n';
}
}
}

return 0;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andrey Tarasevich@21:1/5 to red floyd on Sat Sep 28 15:21:29 2024

On 08/22/24 5:00 PM, red floyd wrote:

You can skip multiples of three, by starting with 7, and then
alternately adding 4 then 2.

... by starting with 5 and then alternately adding 2 then 4, if you want
to be follow the general idea of wheel factorization.

Or by starting with 7 and then adding cyclically { 4, 2, 4, 2, 4, 6, 2, 6 }.

And so forth...

--
Best regards,
Andrey

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Keith Thompson on Wed Oct 2 11:44:50 2024

On 28/09/2024 21:49, Keith Thompson wrote:

Tim Rentsch <[email protected]> writes:

Vir Campestris <[email protected]d> writes:

[...]

I tried to post a short followup, but I must have done something
wrong because nothing went out. Long story short, I've been
pulled away from looking at this further due to other tasks
demanding my attention, and also because the Sieve of Aiken
looks more promising (having been reminded recently of
primegen by Daniel Bernstein).

Typo: Sieve of Atkin.

https://en.wikipedia.org/wiki/Sieve_of_Atkin

In any case, thank you for the back and forth, it's been fun.

Well, the holiday was fun (the Zeppelin museum on the Bodensee was
especially interesting for geeks) but my wife picked up what we think
was COVID in one of the hotels so she was a bit **** towards the end.

Then she gave it to me :(

However...

I now have a working version storing all primes as (prime/30
mask(prime%30) and tuned up.

The disappointing thing is that it's not until you get to seriously big
numbers - something in the 1e9 range - that this is faster than the code
I posted before. And even then it's not _much_ faster. About 20% at 1e11.

Over in comp.lang.c this came up too, and there's a link there to the
Sieve of Atkin

<https://en.wikipedia.org/wiki/Sieve_of_Atkin>

and to a program

<http://cr.yp.to/primegen.html>

which implements it.

I'm pleased to be able to report that that program, written for a 1998
CPU, is slower than mine on my modern CPU.

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to All on Sat Sep 28 03:46:21 2024

Vir Campestris <[email protected]d> writes:

[...]

I tried to post a short followup, but I must have done something
wrong because nothing went out. Long story short, I've been
pulled away from looking at this further due to other tasks
demanding my attention, and also because the Sieve of Aiken
looks more promising (having been reminded recently of
primegen by Daniel Bernstein).

In any case, thank you for the back and forth, it's been fun.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Mon Oct 7 08:41:21 2024

Vir Campestris <[email protected]d> writes:

On 28/09/2024 21:49, Keith Thompson wrote:

Tim Rentsch <[email protected]> writes:

Vir Campestris <[email protected]d> writes:

[...]

I tried to post a short followup, but I must have done something
wrong because nothing went out. Long story short, I've been
pulled away from looking at this further due to other tasks
demanding my attention, and also because the Sieve of Aiken
looks more promising (having been reminded recently of
primegen by Daniel Bernstein).

Typo: Sieve of Atkin.

https://en.wikipedia.org/wiki/Sieve_of_Atkin

In any case, thank you for the back and forth, it's been fun.

[...]

I now have a working version storing all primes as (prime/30
mask(prime%30) and tuned up.

The disappointing thing is that it's not until you get to seriously
big numbers - something in the 1e9 range - that this is faster than
the code I posted before. And even then it's not _much_ faster. About
20% at 1e11.

This reaction, more broadly, has been rather surprising. The whole
point of using a sieve is that it is asymptotically faster. Who
cares about how fast primes under a billion, or ten billion, or
fifty billion, can be computed? It's much more interesting to ask
how high a limit can be searched in a day or a week of computing.
(I don't mean just your comment, but comments from other folks
bragging about how fast they can find all 32-bit primes.) I ran
my earlier programs up to limits of 3 trillion or so.

Over in comp.lang.c this came up too, and there's a link there to the
Sieve of Atkin

<https://en.wikipedia.org/wiki/Sieve_of_Atkin>

and to a program

<http://cr.yp.to/primegen.html>

which implements it.

I saw this link but didn't play around with the program.

I'm pleased to be able to report that that program, written for a 1998
CPU, is slower than mine on my modern CPU.

What has been interesting is that these days prime-finding programs
really need to be tailored to the hardware they will be run on.
Somehow that seems like a step backwards. Fifty years ago programmers
always thought about low-level hardware characteristics when writing
programs. Over time the fraction of time spent worrying about such
things has gotten smaller and smaller, to the point now where it is
almost never important. But wanting to write a fast prime-finding
program has taken us back to the glory days of yesteryear. ;)

Incidentally, I found a program primesieve that's available under
Ubuntu Linux. On a whim I had it find primes less than 2**49,
which took slightly more than 12 hours elapsed time (running on 12
cores). If computing primes up to N, a nice ratio to compute is

(number of primes less than N) * log N / N

which converges to 1, but very slowly. For primes less than 2**49
I got 1.03135087927594382.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Vir Campestris@21:1/5 to Tim Rentsch on Sun Oct 20 12:44:21 2024

On 07/10/2024 16:41, Tim Rentsch wrote:

This reaction, more broadly, has been rather surprising. The whole
point of using a sieve is that it is asymptotically faster. Who
cares about how fast primes under a billion, or ten billion, or
fifty billion, can be computed? It's much more interesting to ask
how high a limit can be searched in a day or a week of computing.
(I don't mean just your comment, but comments from other folks
bragging about how fast they can find all 32-bit primes.) I ran
my earlier programs up to limits of 3 trillion or so.

My programs all require the sieved prime map to fit in RAM. Which in my
system limits me to about 10^11 primes.

It has occurred to me to extend it for much larger numbers, by paging
out to disc files. But I've wasted far too much time on this already!

Andy

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Vir Campestris on Mon Nov 4 03:56:28 2024

Vir Campestris <[email protected]d> writes:

On 07/10/2024 16:41, Tim Rentsch wrote:

This reaction, more broadly, has been rather surprising. The whole
point of using a sieve is that it is asymptotically faster. Who
cares about how fast primes under a billion, or ten billion, or
fifty billion, can be computed? It's much more interesting to ask
how high a limit can be searched in a day or a week of computing.
(I don't mean just your comment, but comments from other folks
bragging about how fast they can find all 32-bit primes.) I ran
my earlier programs up to limits of 3 trillion or so.

My programs all require the sieved prime map to fit in RAM. Which in
my system limits me to about 10^11 primes.

It has occurred to me to extend it for much larger numbers, by paging
out to disc files. But I've wasted far too much time on this already!

I propose a new challenge: compute all primes up to a large number
(e.g., 10e12) with a program that has a small memory footprint (and
no storing intermediate values on disk, etc).

I wrote a small program in C, with a memory footprint well under
one megabyte, and found all primes under 10240000000000 (all it
did was count them, which I checked using primesieve). About 150
lines of fairly routine C code.

If you want a sanity check there are 354080024725 primes less
than 10240000000000.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	80:51:56
Calls:	12,451
Calls today:	1
Files:	15,194
Messages:	6,537,735

OT: Re: Sieve of Erastosthenes optimized to the max

Who's Online

Recent Visitors

System Info