• OT: Re: Sieve of Erastosthenes optimized to the max

    From Vir Campestris@21:1/5 to Tim Rentsch on Thu Jul 25 12:46:02 2024
    On 23/07/2024 15:34, Tim Rentsch wrote:
    Vir Campestris <[email protected]d> writes:

    [...]

    I was hoping for some feedback, even if very brief,
    before I continue.

    Sorry, I was slow. It's really odd, I'm retired, and I ought to have
    lots of time for things like this... and yet I seem to be busy all the time.

    On 20/07/2024 15:41, Tim Rentsch wrote:
    Vir Campestris <[email protected]d> writes:

    On 02/07/2024 07:20, Tim Rentsch wrote:

    I got the code. [...]


    I checked. The modulus operations (%) are easy to check for.

    There are two for each prime number. Not for each multiple of it, but
    for the primes themselves.

    And there's one in the "get" function that checks to see if a number
    is prime. [...]

    And one divide for each modulus.

    I found searching my code for the '%' character was quite easy.
    '/', however, is in all my comments, so I get lots of false positives.


    Both divide and mod can be simplified if we use a different
    representation for numbers. The idea is to separate the
    number into a mod 30 part and divided by 30 part. An
    arbitrary number N can be split into an ordered pair

    N/30 , N%30

    stored in, let's say, an 8 byte quantity with seven bytes
    for N/30 and one byte for N%30. Let me call these two
    parts i and a for one number N, and j and b for a second
    number M. Then two form N*M we need to find


    I suspect the cost of extracting the divided value from the 7 bytes will
    be prohibitive. You may find it's best to have one table for the N/30
    parts (big entries, word aligned) and another for the N%30 part (small
    entries, byte aligned). BICBW.

    (i*30+a) * (j*30+b)

    which is

    (i*30*j*30) + (i*30*b) + (j*30*a) + a*b

    Of course we want to take the product and express it in
    the form (product/30, product%30), which can be done by

    30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

    The two numbers a and b are both under 30, so a*b/30 and
    a*b%30 can be done by lookup in a small array.

    30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

    Now all operations can be done using just multiplies and
    table lookups. Dividing by 30 is just taking the first
    component; remainder 30 is just taking the second component.

    One further refinement: for numbers relatively prime to 2,
    3, and 5, there are only 8 possibles, so the modulo 30
    component can be reduced to just three bits. That makes the
    tables smaller, at a cost of needing a table lookup to find
    and 'a' or 'b' part.

    Does this all make sense?

    Yes, it does. It's not a direction I was thinking of at all, although I
    too had table lookups in mind.

    What I had considered is an extrapolation from my original code.

    For each prime my original code starts with p^2, and sets the bit. It
    than adds p*2, and sets the bit for that. Since I wasn't storing even
    numbers every prime is odd, which means p^2 is odd. p^2+p would be even,
    so I skip that and go on to p^2+p*2, and so on.

    Given the mod30 storage I should be able to make a table with 8 entries,
    each containing a step and a mask. The step is the number of bytes in
    the big table to step to for the next bit to set; the mask is the bit to
    set.

    For 37, for example, the first bits we need to set are these:

    Mul'ple value /30 Step %30
    37 1369 45 19
    41 1517 50 5 17
    47 1739 57 7 29
    49 1813 60 3 13
    53 1961 65 5 11
    59 2183 72 7 23
    61 2257 75 3 7

    67 2479 82 7 19
    71 2627 87 5 17
    77 2849 94 7 29
    79 2923 97 3 13
    83 3071 102 5 11
    89 3293 109 7 23
    91 3367 112 3 7


    You'll see the pattern; there is a repeat 8 lines long of both the Step
    (which is the delta in the word we need to access) and the %30 value.

    So what the code should do is start at p^2, and build this table. It'll
    be different for every prime, and will be quite expensive to build. But
    once the table is built it can add the step onto the current table
    index, then set the bit corresponding to the %30 value. The table should
    of course contain only the step and the bitmask corresponding to the %30
    value.

    (I used 37 in this example BTW, because I wanted a value more than 30,
    and with 31 the /30 column increments by 1 each time I added a prime)

    For small values, as we discussed, it's going to be faster to copy the
    small table for that prime than to apply the mask to each word of the
    store individually.

    I had also considered looking for a mod-div value greater than 30.

    Bonita's original code stored a bit for every prime. We can call that a compression ratio of 1. I store the odd only, so the compression ratio
    is 2. By storing mod30 you increase that to 3.75. But there's a cost on
    modern computers because it implies lots of byte level access to store.

    If on the other hand we store mod(2*3*5*7*11), which is 2310, we get 480 candidates. 480 isn't far off 512 which is a nice size for us to handle,
    and gives us an even better compression ratio of just over 4.5. But of
    course all those tables with 8 or 30 entries will be getting a bit big
    and painful.

    I might even get around to writing some of it. Be warned, this is a time
    trap!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Sat Aug 10 07:07:07 2024
    Vir Campestris <[email protected]d> writes:

    On 23/07/2024 15:34, Tim Rentsch wrote:

    Vir Campestris <[email protected]d> writes:

    [...]

    I was hoping for some feedback, even if very brief,
    before I continue.

    Sorry, I was slow. It's really odd, I'm retired, and I ought to
    have lots of time for things like this... and yet I seem to be
    busy all the time.

    No worries, I understand.

    I'm sorry to be so long in responding. Among other things
    there was some sort of snafu with my news server, resulting
    in hundreds of lost and/or duplicated messages, and that
    took a while to sort out.

    On 20/07/2024 15:41, Tim Rentsch wrote:

    Vir Campestris <[email protected]d> writes:

    On 02/07/2024 07:20, Tim Rentsch wrote:
    [...]
    Both divide and mod can be simplified if we use a different
    representation for numbers. The idea is to separate the
    number into a mod 30 part and divided by 30 part. An
    arbitrary number N can be split into an ordered pair

    N/30 , N%30

    stored in, let's say, an 8 byte quantity with seven bytes
    for N/30 and one byte for N%30. Let me call these two
    parts i and a for one number N, and j and b for a second
    number M. Then two form N*M we need to find

    I suspect the cost of extracting the divided value from the 7
    bytes will be prohibitive. You may find it's best to have one
    table for the N/30 parts (big entries, word aligned) and another
    for the N%30 part (small entries, byte aligned). BICBW.

    Two tables, absolutely. What does BICBW stand for?

    (i*30+a) * (j*30+b)

    which is

    (i*30*j*30) + (i*30*b) + (j*30*a) + a*b

    Of course we want to take the product and express it in
    the form (product/30, product%30), which can be done by

    30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

    The two numbers a and b are both under 30, so a*b/30 and
    a*b%30 can be done by lookup in a small array.

    30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

    Now all operations can be done using just multiplies and
    table lookups. Dividing by 30 is just taking the first
    component; remainder 30 is just taking the second component.

    One further refinement: for numbers relatively prime to 2,
    3, and 5, there are only 8 possibles, so the modulo 30
    component can be reduced to just three bits. That makes the
    tables smaller, at a cost of needing a table lookup to find
    and 'a' or 'b' part.

    Does this all make sense?

    Yes, it does. It's not a direction I was thinking of at all,
    although I too had table lookups in mind.

    What I had considered is an extrapolation from my original code.

    I read the comments below in conjunction with getting a better
    understanding of your original code, which is part of what took
    me so long before responding. Before I get to the stuff below
    (which I think I will save for a second reply), here is some code
    expanding on what I said above. It should compile cleanly as is.
    However, its purpose is to illustrate what I said above.

    typedef unsigned long NN, Index, Count, Mask;
    typedef unsigned char Byte;

    static void remove_multiples( Index, Count, Index );

    static Byte bytes[1000000000];
    static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };

    Count
    sieve( NN limit ){
    Count r = 0;
    Index i;
    Count shift;
    Index i_limit = (limit + 29)/30;

    bytes[0] = 1;
    for( i = 0; i < i_limit; i++ ){
    Byte bits = bytes[i];
    for( shift = 0; shift < 8; shift++ ){
    NN p = i*30 + mods[shift];
    if( bits & 1<<shift ) continue;
    if( p >= limit ) continue;
    r++; // count how many primes found
    // could print the prime 'p' here if desired
    if( p*p < limit ) remove_multiples( i, shift, i_limit );
    }
    }

    return r;
    }

    static Byte shift_for[30] = {
    9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7,
    };

    static Byte ab_div_30[8][8] = {
    { 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 },
    { 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 },
    { 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 },
    { 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 },
    { 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 },
    { 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 },
    { 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 },
    { 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 },
    };

    static Byte ab_mod_30[8][8] = {
    { 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 },
    { 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 },
    { 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 },
    { 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 },
    { 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 },
    { 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 },
    { 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 },
    { 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 },
    };

    void
    remove_multiples( Index i, Count shift_a, Index i_limit ){
    Index j, k;
    NN a = mods[shift_a], b, c;
    Count shift;

    for( shift = shift_a; shift < 8; shift++ ){
    j = i;
    b = mods[shift];
    k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
    c = ab_mod_30[ shift_a ][ shift ];
    if( k >= i_limit ) continue;
    bytes[k] |= 1 << shift_for[c];
    }

    for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
    for( shift = 0; shift < 8; shift++ ){
    b = mods[shift];
    k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
    c = ab_mod_30[ shift_a ][ shift ];
    if( k >= i_limit ) continue;
    bytes[k] |= 1 << shift_for[c];
    }
    }
    }

    In both sieve() and remove_multiples(), there is an inner loop
    that ranges over the possible shift values. (There is also an
    unnested loop in remove_multiples() that ranges over a subset of
    shift values, but for now please ignore that.)

    Since these loops always perform exactly eight iterations, they
    can be completely unrolled without expanding the code too much.
    Here is a revised version of the outer loop in sieve() (actually
    the code below is not yet ready for compilation, because some
    things have changed):

    for( i = 0; i < i_limit; i++ ){
    Byte bits = bytes[i] ^ 0xFF;
    if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
    if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
    if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
    if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
    if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
    if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
    if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
    if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
    }

    We changed the interface to remove_multiples() to pass along the
    index value 'i' and also the value for 'a' rather than the value
    calculated for the prime 'p'. Since the loop has been unrolled,
    the values for the 'a' arguments are all constants (and of course
    all the mask values are constants).

    After making a similar change to the remove_multiples() loop, the
    values for the 'b's are likewise all constants. If the optimizer
    is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
    into constants. Also the expressions 'i*b' and 'j*a' all turn
    into multiplication by a constant. Furthermore, to determine the
    appropriate bit value to 'or' in, an eight-choice conditional
    expression that tests each of the eight possible residues can
    similarly be optimized away so the masks to 'or' in are all
    constants. Do you see how that all works?

    The point of all this unrolling is to get rid of all the shifting
    and table lookups. The only variables involved are the loop
    variables 'i' and 'j'; everything else has been turned into a
    constant. Granted, the generated code has been expanded a great
    deal (basically a factor of 8*8 == 64), but the pieces are now
    much simpler so the result is still manageable.

    I wanted to be sure you understand how the original idea is
    transformed into a better performing version before posting
    something more refined. I get that your time is at a premium,
    still you might trying going through the exercise of unrolling
    the loops and writing a set_product() routine (or macro) to get a
    sense of where I'm going here.

    Hope to send response to send part sometime soon.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to All on Sat Aug 10 17:24:53 2024
    Vir Campestris <[email protected]d> writes:

    This post is my second response to your comments, more
    specifically to one part of your comments.

    On 20/07/2024 15:41, Tim Rentsch wrote:

    Vir Campestris <[email protected]d> writes:

    On 02/07/2024 07:20, Tim Rentsch wrote:

    [...]

    I had also considered looking for a mod-div value greater than 30.

    Bonita's original code stored a bit for every prime. We can call
    that a compression ratio of 1. I store the odd only, so the
    compression ratio is 2. By storing mod30 you increase that to
    3.75. But there's a cost on modern computers because it implies
    lots of byte level access to store.

    If on the other hand we store mod(2*3*5*7*11), which is 2310, we
    get 480 candidates. 480 isn't far off 512 which is a nice size
    for us to handle, and gives us an even better compression ratio of
    just over 4.5. But of course all those tables with 8 or 30
    entries will be getting a bit big and painful.

    I tried using a mod 240 encoding, so the units are 64-bit elements,
    to see if the more natural sized access would help. The result ran
    slower than the mod 30 code.

    The idea of using more primes seems like a natural extension, and
    maybe it can work. Certainly it can get greater compression, which
    is a plus. I had looked before at using a mod 210 encoding (which
    is 30 * 7). My sense is that, unless the extra compression is
    essential, the added code complexity will hurt performance more
    than it helps. But I never got to the point of actually coding
    it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to All on Sun Aug 11 00:00:50 2024
    Vir Campestris <[email protected]d> writes:

    This posting is my third (and I think last) response to your
    last set of comments.

    On 20/07/2024 15:41, Tim Rentsch wrote:

    Vir Campestris <[email protected]d> writes:

    On 02/07/2024 07:20, Tim Rentsch wrote:
    [...]

    What I had considered is an extrapolation from my original code.

    About the original code.. One part of it took me a while to
    figure out. I would summarize it as follows. Rather than
    striking all multiples of a given prime before going on to the
    next one, you put a limit on the memory of where the strikes
    can occur to keep those bytes in the L1 cache. So several or
    many primes are processed at once, as long as the products are
    inside the target area (which I think is on the order of half a
    megabyte). I haven't done any careful measurements, but it
    does appear as though this approach provides a significant
    performance gain. (I don't have any estimates for how much.)

    A cost of using this method is it takes some memory to keep
    track of where any non-finished primes are in their multiple
    striking. That cost can be relevant if amount of memory needed
    starts to be a significant portion of the L1 cache size.

    For each prime my original code starts with p^2, and sets the bit.
    It than adds p*2, and sets the bit for that. Since I wasn't
    storing even numbers every prime is odd, which means p^2 is odd.
    p^2+p would be even, so I skip that and go on to p^2+p*2, and so
    on.

    Given the mod30 storage I should be able to make a table with 8
    entries, each containing a step and a mask. The step is the
    number of bytes in the big table to step to for the next bit to
    set; the mask is the bit to set.

    For 37, for example, the first bits we need to set are these:

    Mul'ple value /30 Step %30
    37 1369 45 19
    41 1517 50 5 17
    47 1739 57 7 29
    49 1813 60 3 13
    53 1961 65 5 11
    59 2183 72 7 23
    61 2257 75 3 7

    67 2479 82 7 19
    71 2627 87 5 17
    77 2849 94 7 29
    79 2923 97 3 13
    83 3071 102 5 11
    89 3293 109 7 23
    91 3367 112 3 7


    You'll see the pattern; there is a repeat 8 lines long of both
    the Step (which is the delta in the word we need to access) and
    the %30 value.

    Right. There is a regular pattern of deltas, corresponding to
    a kind of strength reduction of doing the multiplies.

    So what the code should do is start at p^2, and build this table.
    It'll be different for every prime, and will be quite expensive to
    build. But once the table is built it can add the step onto the
    current table index, then set the bit corresponding to the %30
    value. The table should of course contain only the step and the
    bitmask corresponding to the %30 value.

    Note that the %30 tables can be shared. There are 8 of them, one
    for each of the eight residues mod 30 of the original prime.

    (I used 37 in this example BTW, because I wanted a value more than
    30, and with 31 the /30 column increments by 1 each time I added a
    prime)

    I get a different result for starting with 31 but that's not
    important, I expect one of us made a simple calculation mistake.

    If we are looking at a prime p = 30*i + a, then the first
    delta for the /30 column is

    (30*i+a) * (30*i+b) / 30 - (30*i+a) * (30*i+a) / 30

    which is

    (30*i*i + i*b + i*a + a*b/30) -
    (30*i*i + i*a + i*a + a*a/30)

    which is

    i*(b-a) + (a*b/30 - a*a/30)

    when we get to 8 elements later, the /30 column is

    (30*i+a) * (30*(i+1)+b) / 30 - (30*i+a) * (30*(i+1)+a) / 30

    which is

    (30*i*i + i + i*b + i*a + a + a*b/30) -
    (30*i*i + i + i*a + i*a + a + a*a/30)

    which is again

    i*(b-a) + (a*b/30 - a*a/30)

    that is, the same delta as the one eight steps before. This
    formula shows us how to calculate the deltas, with the help of one
    table for the (a*b'/30 - a*b/30) values, and one table for the
    (b'-b) values, which need to be multiplied by the appropriate i
    value (namely, p/30).

    So I think the delta calculation can be written

    i*delta_b[x] + delta_bb[x]

    where x goes in a circular loop over 8 values [ x0 ... x7 ].

    All the above assuming I haven't made an algebraic mistakes, which
    certainly is not guaranteed.

    Besides showing how to calculate the deltas, the last formula shows
    how to reduce the amount of memory needed to save the state of
    processing a given prime. Namely, all that is needed is the
    running value of what byte holds the product bit, the index i of
    the original prime, and the value for x (three bits) along with
    something that shows what range it cycles through (another three
    bits). It should be easy to hold all that state in two 64-bit
    words.

    If instead we tried to store a table with 8 entries for each prime
    that is still being processed, that would use a lot more memory and
    eat into our L1 cache memory budget. At some point we have to
    wonder if it would be cheaper to do more recalculation but need
    less memory to keep track of where we are. Not a simple problem.

    Forgive me if the above seems a bit scattered. I started with a
    rough roadmap and just dived in and began writing. Feel free to
    ask about anything that isn't clear or doesn't make sense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Mon Aug 12 15:32:33 2024
    (replying to two messages at once)

    On 10/08/2024 15:07, Tim Rentsch wrote:
    I suspect the cost of extracting the divided value from the 7
    bytes will be prohibitive. You may find it's best to have one
    table for the N/30 parts (big entries, word aligned) and another
    for the N%30 part (small entries, byte aligned). BICBW.
    Two tables, absolutely. What does BICBW stand for?

    But I Could Be Wrong. I thought that one was fairly well known.

    On 11/08/2024 08:00, Tim Rentsch wrote:
    Note that the %30 tables can be shared. There are 8 of them, one
    for each of the eight residues mod 30 of the original prime.

    I started coding that up. Then it occurred to me that there wasn't going
    to be much saving by storing a pointer (8 bytes on my machine) to the
    correct 8-byte long table. I suppose I could store a byte array index
    instead, but I've gone off in other directions.

    You've picked up my cache management tweak.

    I'll read through your code. But it might not be today. We've got the
    hottest day of the year here, and I don't have aircon in my garden
    office. In winter the computer and me are enough to keep it warm, but
    right now it's getting warmer by the minute.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Thu Aug 15 17:52:10 2024
    I'm going to come back to this slightly earlier version, and add
    comments inline.

    On 10/08/2024 15:07, Tim Rentsch wrote:
    Vir Campestris <[email protected]d> writes:

    On 23/07/2024 15:34, Tim Rentsch wrote:

    Vir Campestris <[email protected]d> writes:

    [...]

    I was hoping for some feedback, even if very brief,
    before I continue.

    Sorry, I was slow. It's really odd, I'm retired, and I ought to
    have lots of time for things like this... and yet I seem to be
    busy all the time.

    No worries, I understand.

    I'm sorry to be so long in responding. Among other things
    there was some sort of snafu with my news server, resulting
    in hundreds of lost and/or duplicated messages, and that
    took a while to sort out.


    I think that was Eternal September - he had an outage.

    On 20/07/2024 15:41, Tim Rentsch wrote:

    Vir Campestris <[email protected]d> writes:

    On 02/07/2024 07:20, Tim Rentsch wrote:
    [...]
    Both divide and mod can be simplified if we use a different
    representation for numbers. The idea is to separate the
    number into a mod 30 part and divided by 30 part. An
    arbitrary number N can be split into an ordered pair

    N/30 , N%30

    stored in, let's say, an 8 byte quantity with seven bytes
    for N/30 and one byte for N%30. Let me call these two
    parts i and a for one number N, and j and b for a second
    number M. Then two form N*M we need to find

    I suspect the cost of extracting the divided value from the 7
    bytes will be prohibitive. You may find it's best to have one
    table for the N/30 parts (big entries, word aligned) and another
    for the N%30 part (small entries, byte aligned). BICBW.

    Two tables, absolutely. What does BICBW stand for?

    (i*30+a) * (j*30+b)

    which is

    (i*30*j*30) + (i*30*b) + (j*30*a) + a*b

    Of course we want to take the product and express it in
    the form (product/30, product%30), which can be done by

    30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

    The two numbers a and b are both under 30, so a*b/30 and
    a*b%30 can be done by lookup in a small array.

    30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

    Now all operations can be done using just multiplies and
    table lookups. Dividing by 30 is just taking the first
    component; remainder 30 is just taking the second component.

    One further refinement: for numbers relatively prime to 2,
    3, and 5, there are only 8 possibles, so the modulo 30
    component can be reduced to just three bits. That makes the
    tables smaller, at a cost of needing a table lookup to find
    and 'a' or 'b' part.

    Does this all make sense?

    Yes, it does. It's not a direction I was thinking of at all,
    although I too had table lookups in mind.

    What I had considered is an extrapolation from my original code.

    I read the comments below in conjunction with getting a better
    understanding of your original code, which is part of what took
    me so long before responding. Before I get to the stuff below
    (which I think I will save for a second reply), here is some code
    expanding on what I said above. It should compile cleanly as is.
    However, its purpose is to illustrate what I said above.

    typedef unsigned long NN, Index, Count, Mask;
    typedef unsigned char Byte;

    I've been using uint64_t, not unsigned long. When I started C unsigned
    long was 32 bits; I'm never quite sure what you'll get.

    static void remove_multiples( Index, Count, Index );

    static Byte bytes[1000000000];
    static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };

    Count
    sieve( NN limit ){
    Count r = 0;
    Index i;
    Count shift;
    Index i_limit = (limit + 29)/30;

    bytes[0] = 1;
    for( i = 0; i < i_limit; i++ ){
    Byte bits = bytes[i];
    for( shift = 0; shift < 8; shift++ ){
    NN p = i*30 + mods[shift];
    if( bits & 1<<shift ) continue;
    if( p >= limit ) continue;
    r++; // count how many primes found
    // could print the prime 'p' here if desired
    if( p*p < limit ) remove_multiples( i, shift, i_limit );
    }
    }

    return r;
    }

    static Byte shift_for[30] = {
    9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7,
    };

    static Byte ab_div_30[8][8] = {
    { 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 },
    { 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 },
    { 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 },
    { 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 },
    { 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 },
    { 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 },
    { 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 },
    { 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 },
    };

    static Byte ab_mod_30[8][8] = {
    { 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 },
    { 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 },
    { 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 },
    { 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 },
    { 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 },
    { 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 },
    { 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 },
    { 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 },
    };


    constexpr on those tables might give the optimiser something else to
    play with.

    void
    remove_multiples( Index i, Count shift_a, Index i_limit ){

    So i is (prime/30) and shift_a is the bit in the mask byte -
    corresponding to (prime%30)?

    Index j, k;
    NN a = mods[shift_a], b, c;
    Count shift;

    for( shift = shift_a; shift < 8; shift++ ){
    j = i;
    b = mods[shift];
    k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
    c = ab_mod_30[ shift_a ][ shift ];
    if( k >= i_limit ) continue;
    bytes[k] |= 1 << shift_for[c];
    }

    for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
    for( shift = 0; shift < 8; shift++ ){
    b = mods[shift];
    k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
    c = ab_mod_30[ shift_a ][ shift ];
    if( k >= i_limit ) continue;
    bytes[k] |= 1 << shift_for[c];
    }
    }
    }

    In both sieve() and remove_multiples(), there is an inner loop
    that ranges over the possible shift values. (There is also an
    unnested loop in remove_multiples() that ranges over a subset of
    shift values, but for now please ignore that.)

    Since these loops always perform exactly eight iterations, they
    can be completely unrolled without expanding the code too much.
    Here is a revised version of the outer loop in sieve() (actually
    the code below is not yet ready for compilation, because some
    things have changed):

    for( i = 0; i < i_limit; i++ ){
    Byte bits = bytes[i] ^ 0xFF;
    if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
    if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
    if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
    if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
    if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
    if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
    if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
    if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
    }

    Before unrolling always benchmark. (Modern compilers are pretty smart,
    and sometimes will unroll for you. Modern processors with branch
    prediction are pretty good about working out when to jump and when not
    to. And with speculative execution they sometimes run both paths from a branch... I could go on!)

    We changed the interface to remove_multiples() to pass along the
    index value 'i' and also the value for 'a' rather than the value
    calculated for the prime 'p'. Since the loop has been unrolled,
    the values for the 'a' arguments are all constants (and of course
    all the mask values are constants).

    After making a similar change to the remove_multiples() loop, the
    values for the 'b's are likewise all constants. If the optimizer
    is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
    into constants. Also the expressions 'i*b' and 'j*a' all turn
    into multiplication by a constant. Furthermore, to determine the
    appropriate bit value to 'or' in, an eight-choice conditional
    expression that tests each of the eight possible residues can
    similarly be optimized away so the masks to 'or' in are all
    constants. Do you see how that all works?

    The point of all this unrolling is to get rid of all the shifting
    and table lookups. The only variables involved are the loop
    variables 'i' and 'j'; everything else has been turned into a
    constant. Granted, the generated code has been expanded a great
    deal (basically a factor of 8*8 == 64), but the pieces are now
    much simpler so the result is still manageable.

    And now a few minutes later as I read that I understand the reason for
    the unrolling better. But my point still stands - benchmark it.

    Of course the decision on whether to unroll or not might depend on the
    target CPU!

    I wanted to be sure you understand how the original idea is
    transformed into a better performing version before posting
    something more refined. I get that your time is at a premium,
    still you might trying going through the exercise of unrolling
    the loops and writing a set_product() routine (or macro) to get a
    sense of where I'm going here.

    Hope to send response to send part sometime soon.

    You seem to have spotted a pattern in the steps that I didn't - I looked
    at the mask values, and decided it wasn't worth working out which of the
    8 possible sequences to use - remembering which was too expensive. But
    it's also (I think, I haven't coded it) to work out whether you need to
    step 2, 4, 8 times the prime. And that would save more memory.

    Another thought BTW - whether it's worth storing with a larger modulo -
    say 2*3*5*7 - not just 2*3*5 is a different decision from deciding
    whether the steps between primes should take that into consideration.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Fri Aug 16 07:48:07 2024
    Vir Campestris <[email protected]d> writes:

    (replying to two messages at once)

    On 10/08/2024 15:07, Tim Rentsch wrote:

    I suspect the cost of extracting the divided value from the 7
    bytes will be prohibitive. You may find it's best to have one
    table for the N/30 parts (big entries, word aligned) and another
    for the N%30 part (small entries, byte aligned). BICBW.

    Two tables, absolutely. What does BICBW stand for?

    But I Could Be Wrong. I thought that one was fairly well known.

    Ah. No wonder I didn't know it. ;)

    On 11/08/2024 08:00, Tim Rentsch wrote:

    Note that the %30 tables can be shared. There are 8 of them, one
    for each of the eight residues mod 30 of the original prime.

    I started coding that up. Then it occurred to me that there wasn't
    going to be much saving by storing a pointer (8 bytes on my machine)
    to the correct 8-byte long table. I suppose I could store a byte array
    index instead, but I've gone off in other directions.

    Right, I was meaning keeping track of the index, not a pointer.

    You've picked up my cache management tweak.

    It's an interesting idea, although I'm not sure how well it
    integrates with what I've done.

    I'll read through your code. But it might not be today. We've got the hottest day of the year here, and I don't have aircon in my garden
    office. In winter the computer and me are enough to keep it warm, but
    right now it's getting warmer by the minute.

    My sympathies. I too find it hard to work when it's too hot.

    I see you have started looking at my code... I'll take a look.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Fri Aug 16 08:40:37 2024
    Vir Campestris <[email protected]d> writes:

    I'm going to come back to this slightly earlier version, and add
    comments inline.

    Okay good deal. I'm taking out some parts that look like they
    are no longer relevant.

    On 10/08/2024 15:07, Tim Rentsch wrote:

    Vir Campestris <[email protected]d> writes:

    On 20/07/2024 15:41, Tim Rentsch wrote:

    Vir Campestris <[email protected]d> writes:

    On 02/07/2024 07:20, Tim Rentsch wrote:

    [...]

    Both divide and mod can be simplified if we use a different
    representation for numbers. The idea is to separate the
    number into a mod 30 part and divided by 30 part. An
    arbitrary number N can be split into an ordered pair

    N/30 , N%30

    stored in, let's say, an 8 byte quantity with seven bytes
    for N/30 and one byte for N%30. Let me call these two
    parts i and a for one number N, and j and b for a second
    number M. Then two form N*M we need to find

    (i*30+a) * (j*30+b)

    which is

    (i*30*j*30) + (i*30*b) + (j*30*a) + a*b

    Of course we want to take the product and express it in
    the form (product/30, product%30), which can be done by

    30*i*j + i*b + j*a + (a*b)/30, (a*b)%30

    The two numbers a and b are both under 30, so a*b/30 and
    a*b%30 can be done by lookup in a small array.

    30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]

    Now all operations can be done using just multiplies and
    table lookups. Dividing by 30 is just taking the first
    component; remainder 30 is just taking the second component.

    One further refinement: for numbers relatively prime to 2,
    3, and 5, there are only 8 possibles, so the modulo 30
    component can be reduced to just three bits. That makes the
    tables smaller, at a cost of needing a table lookup to find
    and 'a' or 'b' part.

    Does this all make sense?

    Yes, it does. It's not a direction I was thinking of at all,
    although I too had table lookups in mind.

    What I had considered is an extrapolation from my original code.

    I read the comments below in conjunction with getting a better
    understanding of your original code, which is part of what took
    me so long before responding. Before I get to the stuff below
    (which I think I will save for a second reply), here is some code
    expanding on what I said above. It should compile cleanly as is.
    However, its purpose is to illustrate what I said above.

    typedef unsigned long NN, Index, Count, Mask;
    typedef unsigned char Byte;

    I've been using uint64_t, not unsigned long. When I started C
    unsigned long was 32 bits; I'm never quite sure what you'll get.

    I'm coming around to the point of view that size_t is the best
    choice for these types. In any case though the point of using a
    typedef is so the more meaningful names can be used but still
    changed easily where they are defined.

    static void remove_multiples( Index, Count, Index );

    static Byte bytes[1000000000];
    static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };

    Count
    sieve( NN limit ){
    Count r = 0;
    Index i;
    Count shift;
    Index i_limit = (limit + 29)/30;

    bytes[0] = 1;
    for( i = 0; i < i_limit; i++ ){
    Byte bits = bytes[i];
    for( shift = 0; shift < 8; shift++ ){
    NN p = i*30 + mods[shift];
    if( bits & 1<<shift ) continue;
    if( p >= limit ) continue;
    r++; // count how many primes found
    // could print the prime 'p' here if desired
    if( p*p < limit ) remove_multiples( i, shift, i_limit ); >> }
    }

    return r;
    }

    static Byte shift_for[30] = {
    9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7, >> };

    static Byte ab_div_30[8][8] = {
    { 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 }, >> { 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 }, >> { 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 }, >> { 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 }, >> { 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 }, >> { 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 }, >> { 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 }, >> { 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 }, >> };

    static Byte ab_mod_30[8][8] = {
    { 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 }, >> { 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 }, >> { 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 }, >> { 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 }, >> { 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 }, >> { 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 }, >> { 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 }, >> { 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 }, >> };

    constexpr on those tables might give the optimiser something else to
    play with.

    All of these tables are going to disappear in the next revision.

    void
    remove_multiples( Index i, Count shift_a, Index i_limit ){

    So i is (prime/30) and shift_a is the bit in the mask byte -
    corresponding to (prime%30)?

    Right. The value 'a' is 'mods[shift_a]', and the value 'shift_a' is 'shift_for[a]'. Again though we will be dispensing with shift_for[]
    and put in code that can be compiled to give constant values.

    Index j, k;
    NN a = mods[shift_a], b, c;
    Count shift;

    for( shift = shift_a; shift < 8; shift++ ){
    j = i;
    b = mods[shift];
    k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
    c = ab_mod_30[ shift_a ][ shift ];
    if( k >= i_limit ) continue;
    bytes[k] |= 1 << shift_for[c];
    }

    for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
    for( shift = 0; shift < 8; shift++ ){
    b = mods[shift];
    k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
    c = ab_mod_30[ shift_a ][ shift ];
    if( k >= i_limit ) continue;
    bytes[k] |= 1 << shift_for[c];
    }
    }
    }

    In both sieve() and remove_multiples(), there is an inner loop
    that ranges over the possible shift values. (There is also an
    unnested loop in remove_multiples() that ranges over a subset of
    shift values, but for now please ignore that.)

    Since these loops always perform exactly eight iterations, they
    can be completely unrolled without expanding the code too much.
    Here is a revised version of the outer loop in sieve() (actually
    the code below is not yet ready for compilation, because some
    things have changed):

    for( i = 0; i < i_limit; i++ ){
    Byte bits = bytes[i] ^ 0xFF;
    if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
    if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
    if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
    if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
    if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
    if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
    if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
    if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
    }

    Before unrolling always benchmark. (Modern compilers are pretty smart,
    and sometimes will unroll for you. Modern processors with branch
    prediction are pretty good about working out when to jump and when not
    to. And with speculative execution they sometimes run both paths from
    a branch... I could go on!)

    I understand the advice. Keep in mind though that the code I'm
    showing (and also true in the next version) is presented as a way of
    explaining how I got to where I got. I didn't start with the loop
    version and unroll it; everything was unrolled right from the get
    go. And there are more revisions to come, even after the next one
    (to be shown below).

    We changed the interface to remove_multiples() to pass along the
    index value 'i' and also the value for 'a' rather than the value
    calculated for the prime 'p'. Since the loop has been unrolled,
    the values for the 'a' arguments are all constants (and of course
    all the mask values are constants).

    After making a similar change to the remove_multiples() loop, the
    values for the 'b's are likewise all constants. If the optimizer
    is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
    into constants. Also the expressions 'i*b' and 'j*a' all turn
    into multiplication by a constant. Furthermore, to determine the
    appropriate bit value to 'or' in, an eight-choice conditional
    expression that tests each of the eight possible residues can
    similarly be optimized away so the masks to 'or' in are all
    constants. Do you see how that all works?

    The point of all this unrolling is to get rid of all the shifting
    and table lookups. The only variables involved are the loop
    variables 'i' and 'j'; everything else has been turned into a
    constant. Granted, the generated code has been expanded a great
    deal (basically a factor of 8*8 == 64), but the pieces are now
    much simpler so the result is still manageable.

    And now a few minutes later as I read that I understand the reason
    for the unrolling better. But my point still stands - benchmark
    it.

    Of course the decision on whether to unroll or not might depend on
    the target CPU!

    For this approach to work it's crucial to unroll completely to make
    sure all the divides and remainders, and also the mask choices, get
    optimized out. I suppose it's possible that some architecture out
    there could be faster without all constant folding, but it seems
    unlikely, and in any case I made an early decision to write the code
    using this approach, which has worked out well.

    I wanted to be sure you understand how the original idea is
    transformed into a better performing version before posting
    something more refined. I get that your time is at a premium,
    still you might trying going through the exercise of unrolling
    the loops and writing a set_product() routine (or macro) to get a
    sense of where I'm going here.

    Hope to send response to send part sometime soon.

    You seem to have spotted a pattern in the steps that I didn't - I
    looked at the mask values, and decided it wasn't worth working out
    which of the 8 possible sequences to use - remembering which was too expensive. But it's also (I think, I haven't coded it) to work out
    whether you need to step 2, 4, 8 times the prime. And that would save
    more memory.

    The possible steps for mod30 are 2 or 4 or 6 times the prime.

    Another thought BTW - whether it's worth storing with a larger modulo
    - say 2*3*5*7 - not just 2*3*5 is a different decision from deciding
    whether the steps between primes should take that into consideration.

    My guess is that the possible steps are still just 2 or 4 or 6
    (times the prime), but I haven't checked that.

    Now here is revised code where the unrolling has been done, and also
    uses optimizable mask production:

    void
    mod30_sieve_x( Index N, NN limit, Index i_limit ){
    Index i;

    bytes[0] = 1;
    for( i = 0; i < i_limit; i++ ){
    Bits cbits = bytes[i];
    Bits pbits = cbits ^ -1;
    if( pbits & 0001 ) remove_multiples_x( N, i, 1, pbits );
    if( pbits & 0002 ) remove_multiples_x( N, i, 7, pbits );
    if( pbits & 0004 ) remove_multiples_x( N, i, 11, pbits );
    if( pbits & 0010 ) remove_multiples_x( N, i, 13, pbits );
    if( pbits & 0020 ) remove_multiples_x( N, i, 17, pbits );
    if( pbits & 0040 ) remove_multiples_x( N, i, 19, pbits );
    if( pbits & 0100 ) remove_multiples_x( N, i, 23, pbits );
    if( pbits & 0200 ) remove_multiples_x( N, i, 29, pbits );
    }
    }


    #define set_product_x( i, a, j, b ) ( \
    bytes[ i*j*30 + i*b + j*a + a*b/30 ] |= BIT_FOR( (a*b%30) ) \
    )

    #define BIT_FOR( m ) ( \
    m == 1 ? 0001 : m == 7 ? 0002 : m == 11 ? 0004 : m == 13 ? 0010 : \
    m == 17 ? 0020 : m == 19 ? 0040 : m == 23 ? 0100 : m == 29 ? 0200 : \
    0 \ )

    void
    remove_multiples_x( Index N, Index i, NN a, Bits ibits ){
    NN p = 30*i + a;
    Index j;
    Index jn = (N*30/p + 29)/30;

    switch( a ){
    case 1: set_product_x( i, a, i, 1 );
    case 7: set_product_x( i, a, i, 7 );
    case 11: set_product_x( i, a, i, 11 );
    case 13: set_product_x( i, a, i, 13 );
    case 17: set_product_x( i, a, i, 17 );
    case 19: set_product_x( i, a, i, 19 );
    case 23: set_product_x( i, a, i, 23 );
    case 29: set_product_x( i, a, i, 29 );
    }

    for( j = i+1; j < jn; j++ ){
    set_product_x( i, a, j, 1 );
    set_product_x( i, a, j, 7 );
    set_product_x( i, a, j, 11 );
    set_product_x( i, a, j, 13 );
    set_product_x( i, a, j, 17 );
    set_product_x( i, a, j, 19 );
    set_product_x( i, a, j, 23 );
    set_product_x( i, a, j, 29 );
    }
    }


    Some comments about the upper bounds of the loops:

    First, assume that 'i_limit' has been computed in conjunction with
    the determination of the parameter N so that it is exactly right.

    Second, the initializing expression for 'jn' is safe but may miss
    part of one more value of j. There are different ways to take care
    of that, but it's a minor detail so I'm skipping over it for now.
    Of course feel free to think about how that could be done if you
    want to, and if not then no worries, I expect to say more about that
    in a later followup.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Tim Rentsch on Sun Aug 18 19:52:00 2024
    Tim Rentsch <[email protected]> writes:

    Vir Campestris <[email protected]d> writes:

    [...]

    You seem to have spotted a pattern in the steps that I didn't - I
    looked at the mask values, and decided it wasn't worth working out
    which of the 8 possible sequences to use - remembering which was
    too expensive. But it's also (I think, I haven't coded it) to work
    out whether you need to step 2, 4, 8 times the prime. And that
    would save more memory.

    The possible steps for mod30 are 2 or 4 or 6 times the prime.

    Another thought BTW - whether it's worth storing with a larger
    modulo - say 2*3*5*7 - not just 2*3*5 is a different decision from
    deciding whether the steps between primes should take that into
    consideration.

    My guess is that the possible steps are still just 2 or 4 or 6
    (times the prime), but I haven't checked that.

    I guessed wrong. Considered mod 210, there are

    15 steps of 2
    15 steps of 4
    14 steps of 6
    2 steps of 8
    2 steps of 10

    (for a total of 48 residues) due to knocking out multiples of 7.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Mon Aug 19 21:23:22 2024
    On 16/08/2024 18:35, Bonita Montero wrote:
    <snip>
    But basically I don't think it is a good idea to skip numbers exept
    multiples of two. With the three you save a sixth of memory, with
    the five you save a 15-th and at the end you get about 20% less
    storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
    for a lot of computation. That's the point where I dropped this
    idea and I think this extra computation is higher than the time
    for the saved memory loads.

    It's not just storage you save, it's also computation.

    That program I published up-thread is almost as fast as yours - within
    10% of the elapsed time - while only using a single core. The amount of
    CPU used is much lower.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Mon Aug 19 21:34:19 2024
    On 16/08/2024 18:35, Bonita Montero wrote:
    But basically I don't think it is a good idea to skip numbers exept
    multiples of two. With the three you save a sixth of memory, with
    the five you save a 15-th and at the end you get about 20% less
    storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
    for a lot of computation. That's the point where I dropped this
    idea and I think this extra computation is higher than the time
    for the saved memory loads.

    BTW I was just checking outputs.

    66049
    67591
    69133
    69647
    71189
    72217
    72731
    75301
    78899
    79927
    80441
    81469
    85067
    86609
    89179
    89693
    90721
    92263
    94319
    95861
    97403
    98431
    99973

    all show in the output for your program, but not mine. I think you'll
    find they are products of 257 and the next few primes.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From red floyd@21:1/5 to All on Mon Aug 19 21:08:11 2024
    So I'm a little late, but here's my effort to use the modulo 30 trick.

    Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5 5600x, 64GB RAM

    g++ -O3 -std=c++17
    5761455 primess less than 100 million in 0.182269s
    50847534 primes less than 1 billion in 2.841167s
    455052511 primes less than 10billion in 53.009133s


    -- CUT HERE ---

    #include <iostream>
    #include <chrono>
    #include <stdint.h>
    #include <vector>
    #include <tuple>
    #include <string>
    #include <iomanip>

    using inttype_t = uint64_t;
    using settype_t = unsigned char;

    class Sieve {
    static constexpr inttype_t MOD_VALUE { 30ull };
    // 1, 7, 11, 13, 17, 19, 23, 29
    static constexpr settype_t masks[] = {
    0x00, 0x01, 0x00, 0x00, 0x00, 0x00, // 1
    0x00, 0x02, 0x00, 0x00, 0x00, 0x04, // 7, 11
    0x00, 0x08, 0x00, 0x00, 0x00, 0x10, // 13, 17
    0x00, 0x20, 0x00, 0x00, 0x00, 0x40, // 19, 23
    0x00, 0x00, 0x00, 0x00, 0x00, 0x80 // 29
    };
    inttype_t max_val;
    std::vector<settype_t> sieve;
    inttype_t nprimes;

    using lookup_t = std::tuple<inttype_t, settype_t>;

    inline lookup_t lookup(inttype_t val)
    {
    return std::make_tuple(val / MOD_VALUE, masks[val % MOD_VALUE]);
    }


    public:
    inline bool is_prime(inttype_t val)
    {
    const auto [index, mask] = lookup(val);
    return static_cast<bool>(sieve[index] & mask);
    }

    inline void mark_prime(inttype_t val)
    {
    ++nprimes;
    const inttype_t incr = val + val;
    for (inttype_t j = val + incr; j <= max_val; j += incr)
    {
    const auto [index, mask] = lookup(j);
    sieve[index] &= ~mask;
    }
    }

    inttype_t get_prime_count() { return nprimes; }

    Sieve(inttype_t max_)
    : max_val(max_)
    , sieve((max_ / MOD_VALUE) + 1, ~settype_t())
    , nprimes(3)
    {
    }
    };

    int main(int argc, char *argv[])
    {
    std::vector<std::string> args(argv, argv+argc);
    constexpr inttype_t default_max = 100ull;
    inttype_t max_val = default_max;
    if (argc > 1)
    {
    max_val = std::stoull(args[1]);
    if (max_val == 0)
    max_val = default_max;
    }

    Sieve sieve(max_val);

    const auto start = std::chrono::steady_clock::now();
    for (inttype_t j = 7 ; j <= max_val ; j += 2)
    {
    if (sieve.is_prime(j))
    {
    sieve.mark_prime(j);
    }
    }
    const auto finish = std::chrono::steady_clock::now();
    const auto diff =
    std::chrono::duration_cast<std::chrono::microseconds>(finish -
    start);
    constexpr uint64_t us_per_sec = 1'000'000;
    auto us = diff.count();

    std::cout << sieve.get_prime_count()
    << " primes found less than "
    << max_val
    << " in "
    << (us / us_per_sec)
    << "."
    << std::setw(6)
    << std::setfill('0')
    << (us % us_per_sec)
    << "s\n";
    return 0;
    }
    -- CUT HERE ---

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Tue Aug 20 17:55:33 2024
    On 20/08/2024 16:24, Bonita Montero wrote:
    Am 20.08.2024 um 17:21 schrieb Bonita Montero:
    Am 19.08.2024 um 22:23 schrieb Vir Campestris:
    On 16/08/2024 18:35, Bonita Montero wrote:
    <snip>
    But basically I don't think it is a good idea to skip numbers exept
    multiples of two. With the three you save a sixth of memory, with
    the five you save a 15-th and at the end you get about 20% less
    storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
    for a lot of computation. That's the point where I dropped this
    idea and I think this extra computation is higher than the time
    for the saved memory loads.

    It's not just storage you save, it's also computation.

    That program I published up-thread is almost as fast as yours -
    within 10% of the elapsed time - while only using a single core. The
    amount of CPU used is much lower.

    Andy

    On my AMD 7990X all primes up to 2 ^ 32 are calculated with my code
    with a 65W-setting of the CPU in 0.168s. The total CPU-time is 3.688s.
    With your code all primes up to 2 ^ 32 take nearly exactly four seconds
    with a single thread. So the overall computation time with my algorithm
    is about seconds less.

    And if I run my code with a single core:

        C:\Users\Boni\Documents\Source\bitmapSieve>timep "x64\Release-clang++\bitmapSieve.exe" 0x100000000 "" 1
        real   1.795s
        user   1.766s
        sys    0.000s
        cylces 8.011.804.500

    Thats less than half the computation time.

    That's really odd. I too have an AMD - although my AMD Ryzen 5 3400G is
    chosen for low power, not speed. Surely it can't be because I use Linux,
    not Windows?

    I'm leaning towards thinking you may have a point on the mod30 code.
    There are more operations on the innermost loop with mod30, although the
    loop goes around fewer times. It also means that the store is forced to
    be byte wide - I find that my original odd-only code is significantly
    faster - about 30% - when the store is 64 bit wide rather than byte.

    Speed of writing the primes out? I don't care. The only reason I put it
    in there was to allow me to check the output. If I did care I'd be
    putting a more efficient enumeration function too.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to red floyd on Tue Aug 20 21:14:30 2024
    On 20/08/2024 05:08, red floyd wrote:
    So I'm a little late, but here's my effort to use the modulo 30 trick.

    Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5 5600x, 64GB RAM

    g++ -O3 -std=c++17
    5761455 primess less than 100 million in 0.182269s
    50847534 primes less than 1 billion in 2.841167s
    455052511 primes less than 10billion in 53.009133s

    It's slow.

    Looking quickly at the (remarkably small) code I see line 30:

    return std::make_tuple(val / MOD_VALUE, masks[val % MOD_VALUE]);

    That mod and div are being called for every single time you mark a prime
    in the bitmap.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Tue Aug 20 20:50:53 2024
    Bonita Montero <[email protected]> writes:
    Am 19.08.2024 um 22:34 schrieb Vir Campestris:


    I wrote a small program

    Most people would use the standard command line
    tools to do this (e.g. sort, uniq and diff).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Scott Lurndal on Thu Aug 22 17:30:40 2024
    On 20/08/2024 21:50, Scott Lurndal wrote:
    Bonita Montero <[email protected]> writes:
    Am 19.08.2024 um 22:34 schrieb Vir Campestris:


    I wrote a small program

    Most people would use the standard command line
    tools to do this (e.g. sort, uniq and diff).

    She's on Windows :P

    I just tried again to check my sanity. And decided neither of us are
    crazy. It's sensitive to the number of primes you ask for.

    $ ./Bonita 100 Bonita.txt && grep -w 66049 Bonita.txt
    $ ./Bonita 1000 Bonita.txt && grep -w 66049 Bonita.txt
    $ ./Bonita 10000 Bonita.txt && grep -w 66049 Bonita.txt
    $ ./Bonita 100000 Bonita.txt && grep -w 66049 Bonita.txt
    $ ./Bonita 1000000 Bonita.txt && grep -w 66049 Bonita.txt
    $ ./Bonita 10000000 Bonita.txt && grep -w 66049 Bonita.txt
    $ ./Bonita 100000000 Bonita.txt && grep -w 66049 Bonita.txt
    $ ./Bonita 1000000000 Bonita.txt && grep -w 66049 Bonita.txt
    $ ./Bonita 10000000000 Bonita.txt && grep -w 66049 Bonita.txt
    66049
    $ ./Bonita 100000000000 Bonita.txt && grep -w 66049 Bonita.txt
    66049
    $


    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Thu Aug 22 21:47:14 2024
    On 22/08/2024 17:38, Bonita Montero wrote:

    I did "findstr /x "66049" primes.txt" on my output and it didn't
    find that number with 1E10 as its maximum.

    Eppur si muove.

    I have no idea what is going on here. I even checked that I get the same
    result using clang instead of g++.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Thu Aug 22 21:56:32 2024
    On 16/08/2024 18:35, Bonita Montero wrote:
    But basically I don't think it is a good idea to skip numbers exept
    multiples of two. With the three you save a sixth of memory, with
    the five you save a 15-th and at the end you get about 20% less
    storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
    for a lot of computation. That's the point where I dropped this
    idea and I think this extra computation is higher than the time
    for the saved memory loads.

    I've been running some experiments.

    Skipping evens only is nice and simple on computation; that's good.

    Skipping something else requires table lookups. I've knocked up some
    code that uses the correct table for skipping primes (but not for mask
    bit selection) and run it for 2*3*5, 2*3*5*7, and 2*3*5*7*11.

    Only the last one is faster than just skipping evens. And it's not much
    faster at the price of much more complexity, and a lot more store use.

    My hack will only work with a prime which is 1 modulo(2*3*5*7*11) and it
    would take a lot more code to make it work for all the other modulo
    values. And even more if I was to make it work out the bitmaps.

    I think we've done this to death now (and still with no C++ content!).

    I'm happy with my algorithm:
    - Work out a bitmask for some small primes, only as far as is needed to
    get repeats (the repeat length is the product of the primes)
    - Work out bitmasks for the next half a dozen. Each of those will have a
    repeat the length of the prime.
    - Combine those together into the output bitmap. That requires fewer
    operations than doing each one into the output bitmap
    - Process the larger primes. Do the output bitmap in blocks for cache friendliness
    Note that the larger the prime the less time it takes.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From red floyd@21:1/5 to Vir Campestris on Thu Aug 22 17:00:42 2024
    On 8/22/2024 1:56 PM, Vir Campestris wrote:
    On 16/08/2024 18:35, Bonita Montero wrote:
    But basically I don't think it is a good idea to skip numbers exept
    multiples of two. With the three you save a sixth of memory, with
    the five you save a 15-th and at the end you get about 20% less
    storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
    for a lot of computation. That's the point where I dropped this
    idea and I think this extra computation is higher than the time
    for the saved memory loads.

    I've been running some experiments.

    Skipping evens only is nice and simple on computation; that's good.

    Skipping something else requires table lookups. I've knocked up some
    code that uses the correct table for skipping primes (but not for mask
    bit selection) and run it for 2*3*5, 2*3*5*7, and 2*3*5*7*11.

    You can skip multiples of three, by starting with 7, and then
    alternately adding 4 then 2.

    e.g.

    incr = 2;
    for (val = 7 ; val < MAX_PRIME_TO_CHECK ; val += incr)
    {
    check_for_prime(val);
    incr = 6 - incr;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to red floyd on Mon Aug 26 08:31:31 2024
    red floyd <[email protected]d> writes:

    So I'm a little late, but here's my effort to use the modulo 30
    trick.

    Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5 5600x,
    64GB RAM

    g++ -O3 -std=c++17
    5761455 primess less than 100 million in 0.182269s
    50847534 primes less than 1 billion in 2.841167s
    455052511 primes less than 10billion in 53.009133s

    I appreciate both the effort and your posting of results.

    [..program..]

    There are several ways that the program is doing noticeably more
    work than it needs to. I haven't done any measurements but I
    believe this extra work accounts for the biggest part of its
    relative slowness. Here is a short simple program to illustrate
    some ways that your program could be sped up.

    (The short simple program is the rest of this posting.)

    #include <stdio.h>
    #include <stdlib.h>


    typedef unsigned long long Index, Count, Bits, NN, N235;


    static const unsigned char mods[8] = { 1, 7, 11, 13, 17, 19, 23, 29, }; static unsigned char bytes[ 1ULL * 50 * 1000 * 1000 * 1000 ];


    static void mod30_sieve( NN );
    static NN product_NN( N235, N235 );
    static _Bool is_marked_composite( N235 );
    static void mark_composite( NN );
    static Count unmarked_less_than( NN );


    int
    main( int argc, char *argv[] ){
    NN ceiling = argc > 1 ? strtoull( argv[1], 0, 10 ) : 1000000000;

    setbuf( stdout, 0 );
    if( ceiling >= 30 * sizeof bytes ){
    printf( " urk.. limit must be less than %zu\n", 30 * sizeof bytes );
    exit( 0 );
    }

    printf( " determining primes less than %llu ...\n", ceiling );
    mod30_sieve( ceiling );
    printf( " ... found %llu primes.\n", unmarked_less_than( ceiling ) );

    return 0;
    }


    void
    mod30_sieve( NN ceiling ){
    N235 i, j;
    NN ijNN;

    for( bytes[0] = 1, i = 0; product_NN( i, i ) < ceiling; i++ ){
    if( is_marked_composite( i ) ) continue;

    for( j = i; ijNN = product_NN( i, j ), ijNN < ceiling; j++ ){
    mark_composite( ijNN );
    }
    }
    }

    NN
    product_NN( N235 x, N235 y ){
    NN xnn = (x>>3)*30 + mods[x&7];
    NN ynn = (y>>3)*30 + mods[y&7];

    return xnn * ynn;
    }


    _Bool
    is_marked_composite( N235 t ){
    return bytes[t>>3] & 1<<(t&7);
    }

    void
    mark_composite( NN nn ){
    static const unsigned char masks[] = {
    [1]= 1, [7]= 2, [11]= 4, [13]= 8, [17]= 16, [19]= 32, [23]=64, [29]=128,
    };
    bytes[ nn/30 ] |= masks[ nn%30 ];
    }


    Count
    unmarked_less_than( NN ceiling ){
    Index i = ceiling/30;
    NN extra = ceiling - i*30;
    Count r = (ceiling > 2) + (ceiling > 3) + (ceiling > 5);

    for( Index j = 0; j < 8 && mods[j] < extra; j++ ){
    r += bytes[i]>>j &1 ^1;
    }

    while( i && i-- ){
    for( Bits b = bytes[i] ^0xFF; b; b >>= 1 ) r += b&1;
    }

    return r;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Mon Aug 26 09:35:55 2024
    Vir Campestris <[email protected]d> writes:

    On 20/08/2024 05:08, red floyd wrote:

    So I'm a little late, but here's my effort to use the modulo
    30 trick.

    Using g++ 12.4.0, Cygwin under Windows 11 22631, Ryzen 5
    5600x, 64GB RAM

    g++ -O3 -std=c++17
    5761455 primess less than 100 million in 0.182269s
    50847534 primes less than 1 billion in 2.841167s
    455052511 primes less than 10billion in 53.009133s

    It's slow.

    Looking quickly at the (remarkably small) code I see line 30:

    return std::make_tuple(val / MOD_VALUE, masks[val % MOD_VALUE]);

    That mod and div are being called for every single time you
    mark a prime in the bitmap.

    That is a cost but I'm pretty sure it's not the most important
    aspect.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to red floyd on Mon Aug 26 10:59:33 2024
    red floyd <[email protected]d> writes:

    On 8/22/2024 1:56 PM, Vir Campestris wrote:

    On 16/08/2024 18:35, Bonita Montero wrote:

    But basically I don't think it is a good idea to skip numbers exept
    multiples of two. With the three you save a sixth of memory, with
    the five you save a 15-th and at the end you get about 20% less
    storage (1 / (2 * 3) + 1 / (2 * 3 * 5) + 1 / (2 * 3 * 5 * 7) ...)
    for a lot of computation. That's the point where I dropped this
    idea and I think this extra computation is higher than the time
    for the saved memory loads.

    I've been running some experiments.

    Skipping evens only is nice and simple on computation; that's good.

    Skipping something else requires table lookups. I've knocked up some
    code that uses the correct table for skipping primes (but not for
    mask bit selection) and run it for 2*3*5, 2*3*5*7, and 2*3*5*7*11.

    You can skip multiples of three, by starting with 7, and then
    alternately adding 4 then 2.

    e.g.

    incr = 2;
    for (val = 7 ; val < MAX_PRIME_TO_CHECK ; val += incr)
    {
    check_for_prime(val);
    incr = 6 - incr;
    }

    As a matter of style I would normally fold the initialization and
    update of 'incr' into the first and third expressions of the for()
    loop control.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Mon Aug 26 12:08:11 2024
    Vir Campestris <[email protected]d> writes:

    On 20/08/2024 16:24, Bonita Montero wrote:

    Am 20.08.2024 um 17:21 schrieb Bonita Montero:

    Am 19.08.2024 um 22:23 schrieb Vir Campestris:

    On 16/08/2024 18:35, Bonita Montero wrote:
    <snip>

    But basically I don't think it is a good idea to skip numbers
    exept multiples of two. [...] I think this extra computation
    is higher than the time for the saved memory loads.

    [...]

    I'm leaning towards thinking you may have a point on the mod30
    code. There are more operations on the innermost loop with mod30,
    although the loop goes around fewer times. It also means that the
    store is forced to be byte wide - I find that my original odd-only
    code is significantly faster - about 30% - when the store is 64
    bit wide rather than byte. [...]

    One motivation for choosing a mod30 representation is being able to
    compute more primes -- almost twice as many. Any machine with 64GB
    of memory should be able to compute primes up to 1.5 trillion.
    Even if an odds-only representation runs faster for more limited
    sizes (and I'm not yet convinced that it does), it doesn't matter
    if it's faster, because it doesn't do the job. I was computing all
    32-bit primes 20 years ago, back when 32-bit machines were a thing.
    The challenge for today's world is correspondingly higher.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Mon Aug 26 12:47:56 2024
    Vir Campestris <[email protected]d> writes:

    On 16/08/2024 18:35, Bonita Montero wrote:

    But basically I don't think it is a good idea to skip numbers
    exept multiples of two. [...]

    I've been running some experiments.

    Skipping evens only is nice and simple on computation; that's
    good.

    Skipping something else requires table lookups. [...]

    It doesn't have to. One point of the code I've been posting is
    that table lookups, including mask productions, are eliminated;
    everything gets optimized away so what's left is all compile-time
    constants.

    I'm happy with my algorithm:
    - Work out a bitmask for some small primes, only as far as is
    needed to get repeats (the repeat length is the product of the
    primes)
    - Work out bitmasks for the next half a dozen. Each of those will
    have a repeat the length of the prime.
    - Combine those together into the output bitmap. That requires
    fewer operations than doing each one into the output bitmap
    - Process the larger primes. Do the output bitmap in blocks for
    cache friendliness

    My experience:

    * Special casing primes between 7 and 29 helps (and because it's
    the first step the results can initialize the map);

    * Special purpose code where primes are taken in pairs, for the
    set of primes more than 29 and less than 360, with each pair
    being folded into the initialized map, helps (incidentally,
    that is 31 pairs, or 62 primes total);

    * The combination of those two steps gives a saving of about 35%;

    * Doing striping (the term I use for working in cache-sized
    blocks) seems not to help much, so I don't use striping except
    incidentally in the prime-pairs step;

    * Forcing the function for the main second-level loop to be
    inlined saves another 15% or so;

    * I haven't tried to measure how the speed-up tricks scale with
    overall number of primes to calculate. My guess is that they
    are close to linear but I don't have any actual measurements.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Bonita Montero on Sun Sep 1 21:23:04 2024
    On 27/08/2024 05:09, Bonita Montero wrote:
    Am 26.08.2024 um 21:08 schrieb Tim Rentsch:

    One motivation for choosing a mod30 representation is being able to
    compute more primes -- almost twice as many.  Any machine with 64GB
    of memory should be able to compute primes up to 1.5 trillion.

    You don't need to hold all primes in memory. Just calculate the primes
    up to the square root and then you can calculate every segment up to
    the maximum from that.

    So you compute all the primes up to the square root of a larger number.

    I was cleaning up my odds-only code with a view to posting it, when I
    had an idea.

    Take 101 (because I can write things more easily).

    I want to mark

    10201,10302,10403,10504,10605,10706...

    With odds only that becomes
    10201,10403,10605,10807,11009,11211...

    and as the loop increment is a constant it is nice and fast. With mod30
    it isn't a constant, and it hurts to work out the increment.

    Except...

    If I mark 10201,13231,16261,19291,22321
    with an increment of 30*p,

    then go back to 10201, add 2p, then increment that by 30
    10201 13231 16261
    and do that for the other 6 of the 8, my inner loop has a constant
    increment, and my code is compatible with the mod30 store.

    That might be quicker. I also have some thoughts over storing the prime
    as two parts. It's 30m+r, where m is modulus and r remainder. r maps one-for-one into the byte mask, and m is the index.

    I'm going to waste lots more time on this I can see!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Sun Sep 1 20:40:59 2024
    Vir Campestris <[email protected]d> writes:

    On 27/08/2024 05:09, Bonita Montero wrote:

    Am 26.08.2024 um 21:08 schrieb Tim Rentsch:

    One motivation for choosing a mod30 representation is being able to
    compute more primes -- almost twice as many. Any machine with 64GB
    of memory should be able to compute primes up to 1.5 trillion.

    You don't need to hold all primes in memory. Just calculate the primes
    up to the square root and then you can calculate every segment up to
    the maximum from that.

    This is a brainless statement (and incidentally illustrates why I
    was motivated not to read postings from BM). In fact no primes need
    be pre-calculated to determine whether any given number is prime.
    The point of using a sieve is the sieve method is much faster than
    not using one. For example, suppose we want to determine all primes
    less than a trillion. Taking the approach of pre-computing only
    those primes less than a million (the square root) and then testing
    numbers individually takes more than 100 times as many operations as
    using a sieve. Furthermore the operations used are more expensive
    for the non-sieve approach - simply setting a bit in the case of a
    sieve, versus computing a remainder in the non-sieve case.

    So you compute all the primes up to the square root of a larger number.

    I was cleaning up my odds-only code with a view to posting it, when I
    had an idea.

    Take 101 (because I can write things more easily).

    I want to mark

    10201,10302,10403,10504,10605,10706...

    With odds only that becomes
    10201,10403,10605,10807,11009,11211...

    Right. Marking of multiples needs to start only at p*p, not
    at p+2.

    and as the loop increment is a constant it is nice and fast. With
    mod30 it isn't a constant, and it hurts to work out the increment.

    Except...

    If I mark 10201,13231,16261,19291,22321
    with an increment of 30*p,

    then go back to 10201, add 2p, then increment that by 30
    10201 13231 16261
    and do that for the other 6 of the 8, my inner loop has a constant
    increment, and my code is compatible with the mod30 store.

    Yes, reordering the loops this way might very well make things
    work better. It also has the property, I believe, that the
    bit to set is the same in each of the eight outer loops, and
    so can be computed only once for each of the eight outer loops.
    Very good!

    That might be quicker. I also have some thoughts over storing the
    prime as two parts. It's 30m+r, where m is modulus and r remainder. r
    maps one-for-one into the byte mask, and m is the index.

    This is what I've been saying all along!

    I'm going to waste lots more time on this I can see!

    I'm interested to see what you come up with.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Tue Sep 3 17:45:32 2024
    On 02/09/2024 04:40, Tim Rentsch wrote:
    Vir Campestris <[email protected]d> writes:

    On 27/08/2024 05:09, Bonita Montero wrote:

    Am 26.08.2024 um 21:08 schrieb Tim Rentsch:

    One motivation for choosing a mod30 representation is being able to
    compute more primes -- almost twice as many. Any machine with 64GB
    of memory should be able to compute primes up to 1.5 trillion.

    You don't need to hold all primes in memory. Just calculate the primes
    up to the square root and then you can calculate every segment up to
    the maximum from that.

    This is a brainless statement (and incidentally illustrates why I
    was motivated not to read postings from BM). In fact no primes need
    be pre-calculated to determine whether any given number is prime.
    The point of using a sieve is the sieve method is much faster than
    not using one. For example, suppose we want to determine all primes
    less than a trillion. Taking the approach of pre-computing only
    those primes less than a million (the square root) and then testing
    numbers individually takes more than 100 times as many operations as
    using a sieve. Furthermore the operations used are more expensive
    for the non-sieve approach - simply setting a bit in the case of a
    sieve, versus computing a remainder in the non-sieve case.


    Umm. I assumed you'd take the primes you had, then sieve them into
    several blocks to exceed your memory size.

    So you compute all the primes up to the square root of a larger number.

    I was cleaning up my odds-only code with a view to posting it, when I
    had an idea.

    Take 101 (because I can write things more easily).

    I want to mark

    10201,10302,10403,10504,10605,10706...

    With odds only that becomes
    10201,10403,10605,10807,11009,11211...

    Right. Marking of multiples needs to start only at p*p, not
    at p+2.

    and as the loop increment is a constant it is nice and fast. With
    mod30 it isn't a constant, and it hurts to work out the increment.

    Except...

    If I mark 10201,13231,16261,19291,22321
    with an increment of 30*p,

    then go back to 10201, add 2p, then increment that by 30
    10201 13231 16261
    and do that for the other 6 of the 8, my inner loop has a constant
    increment, and my code is compatible with the mod30 store.

    Yes, reordering the loops this way might very well make things
    work better. It also has the property, I believe, that the
    bit to set is the same in each of the eight outer loops, and
    so can be computed only once for each of the eight outer loops.
    Very good!

    It ought to be fast. Perhaps I'll have something one day. But not before
    I go on holiday, so it will be a few weeks.

    That might be quicker. I also have some thoughts over storing the
    prime as two parts. It's 30m+r, where m is modulus and r remainder. r
    maps one-for-one into the byte mask, and m is the index.

    This is what I've been saying all along!

    Ah, sorry.
    I'm going to waste lots more time on this I can see!

    I'm interested to see what you come up with.

    The code that follows is *NOT* doing anything with mod30, it's just a
    cleaned up, not using include files, version of my odds only program.

    The code that writes the primes out is not optimised. Not even slightly.
    And it's way slower than Bonita's version.

    But the code that does the calculation is a lot faster:

    $ time ./Bonita 100000000000 && time ./usenet 100000000000

    real 1m46.452s
    user 2m53.666s
    sys 0m32.260s

    781250001,40.4744
    594,0.0709076
    598,2.60332
    620,2.60697
    628,20.4567
    620,20.4572
    628,40.4744
    4294967295,40.4744

    real 0m40.723s
    user 0m39.104s
    sys 0m1.604s

    Andy
    --
    /*
    Programme to calculate primes using the Sieve or Eratosthenes
    Copyright (C) 2024 the person known as Vir Campestris

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program. If not, see <https://www.gnu.org/licenses/>.
    */

    #include <algorithm>
    #include <cassert>
    #include <chrono>
    #include <climits>
    #include <cmath>
    #include <csignal>
    #include <cstring>
    #include <fstream>
    #include <iomanip>
    #include <iostream>
    #include <list>
    #include <memory>
    #include <new>
    #include <thread>
    #include <vector>

    // If these trigger you have a really odd architecture. static_assert(sizeof(uint8_t) == 1);
    static_assert(CHAR_BIT == 8);

    // The type of a prime number.
    // size_t isn't big enough on a 32-bit machine with 2GB memory allocated
    and 16 primes per byte.
    typedef uint64_t PRIME;

    // This is the type of the word I operate in. The code will run OK with
    uin8_t or uint32_t,
    // but *on my computer* it's slower.
    typedef uint64_t StoreType;

    // tuneable constants
    constexpr size_t tinyCount = 5;
    constexpr size_t smallCount = 10;
    constexpr size_t bigCache = 16384;

    // this is a convenience class for timing.
    class Stopwatch {
    typedef std::chrono::steady_clock CLOCK;
    typedef std::pair<unsigned, std::chrono::time_point<CLOCK> > LAP;
    typedef std::vector<LAP> LAPS;
    LAPS mLaps;
    public:
    typedef std::vector<std::pair<unsigned, double>> LAPTIMES;
    Stopwatch() {}

    // record the start of an interval
    void start()
    {
    mLaps.clear();
    mLaps.emplace_back(std::make_pair(0, CLOCK::now()));
    }

    void lap(unsigned label)
    {
    mLaps.emplace_back(std::make_pair(label, CLOCK::now()));
    }

    // record the end of an interval
    void stop()
    {
    mLaps.emplace_back(std::make_pair(-1, CLOCK::now()));
    }

    // report the difference between the last start and stop
    double delta() const
    {
    assert(mLaps.size() >= 2);
    assert(mLaps.front().first == 0);
    assert(mLaps.back().first == -1);
    return std::chrono::duration_cast<std::chrono::nanoseconds>(mLaps.back().second
    - mLaps.front().second).count() / (double)1e9;
    }

    LAPTIMES const laps() const
    {
    LAPTIMES ret;
    assert(mLaps.size() >= 2);
    ret.reserve(mLaps.size() - 1);
    auto lap = mLaps.begin();
    auto start = lap->second;

    for(++lap; lap != mLaps.end(); ++lap)
    {
    ret.emplace_back(
    std::make_pair(
    lap->first,

    std::chrono::duration_cast<std::chrono::nanoseconds>(lap->second - start).count() / (double)1e9));

    }
    return ret;
    }
    };

    // Internal class used to store the mask data for one or more
    // primes rather than all primes. This has an initial block,
    // followed by a block which contains data which is repeated
    // for all higher values.
    // Example: StoreType== uint8_t, prime == 3, the mask should be
    // 90 meaning 1,3,5,7,11,13 are prime but not 9, 15
    // 24,49,92 covers odds 17-63
    // 24,49,92 covers 65-111, with an identical mask
    // As the mask is identical we only need to store the
    // non-repeat, plus one copy of the repeated block and data to
    // show where it starts and ends. Note that for big primes the
    // initial block may be more than one StoreType.
    // As there are no common factors between any prime and the
    // number of bits in StoreType the repeat is the size of the
    // prime.
    // Note also that a masker for two primes will have an
    // initial block which is as big as the larger of the sources,
    // and a repeat which is the product of the sources.
    class Masker final
    {
    // The length of the initial block.
    size_t initSize;

    // The size of the repeated part. The entire store size is the
    sum of these two.
    size_t repSize;

    // Buffer management local class. I started using std::vector
    for the store,
    // but I found it was slow for large sizes as it insists on
    calling the constructor
    // for each item in turn. And there may be billions of them.
    class Buffer final
    {
    StoreType* mpBuffer; // base of the data referred to
    size_t mSize; // size of the buffer in
    StoreType-s.

    public:
    // Normal constructor
    Buffer(size_t size = 0) : mpBuffer(nullptr), mSize(size)
    {
    if (size > 0)
    {
    mpBuffer = static_cast<StoreType*>(std::malloc(size * sizeof(StoreType)));

    if (!mpBuffer)
    throw std::bad_alloc();
    }
    }

    Buffer(Buffer&) = delete;

    // Move constructor
    Buffer(Buffer&& source)
    : mpBuffer(source.mpBuffer), mSize(source.mSize)
    {
    source.mSize = 0;
    source.mpBuffer = nullptr;
    };

    Buffer& operator=(Buffer&) = delete;

    // Move assignment
    Buffer& operator=(Buffer&& source)
    {
    if (mpBuffer)
    std::free(mpBuffer);
    mpBuffer = source.mpBuffer;
    mSize = source.mSize;
    source.mSize = 0;
    source.mpBuffer = nullptr;
    return *this;
    }

    void resize(size_t newSize)
    {
    mpBuffer = static_cast<StoreType*>(std::realloc(mpBuffer, newSize *
    sizeof(StoreType)));
    if (!mpBuffer)
    throw std::bad_alloc();
    mSize = newSize;
    }

    // Get the data start
    inline StoreType* operator*() const
    {
    return mpBuffer;
    }

    // Get the owned size
    inline size_t const & size() const
    {
    return mSize;
    }

    // clean up.
    ~Buffer()
    {
    if (mpBuffer)
    std::free(mpBuffer);
    }
    } mBuffer;

    // Raw pointer to the store for speed.
    StoreType * mStorePtr;

    // shift from the prime value to the index in mStore
    static constexpr size_t wordshift = sizeof(StoreType) == 1 ? 4
    : sizeof(StoreType) == 4 ? 6 : sizeof(StoreType) == 8 ? 7 : 0;
    static_assert(wordshift);

    // Mask for the bits in the store that select a prime.
    static constexpr size_t wordmask = (StoreType)-1 >>
    (sizeof(StoreType) * CHAR_BIT + 1 - wordshift);

    // convert a prime to the index of a word in the store
    static inline size_t index(PRIME const & prime)
    {
    return prime >> wordshift;
    }

    // get a mask representing the bit for a prime in a word
    static inline StoreType mask(PRIME const & prime)
    {
    return (StoreType)1 << ((prime >> 1) & wordmask);
    }


    public:
    // This class allows us to iterate through a Masker
    indefinitely, retrieving words,
    // automatically wrapping back to the beginning of the repat
    part when the end is met.
    class MaskIterator
    {
    StoreType const * index, *repStart, *repEnd;
    public:
    MaskIterator(Masker const * origin)
    : index(origin->mStorePtr)
    , repStart(index + origin->initSize)
    , repEnd(repStart + origin->repSize)
    {}

    inline StoreType const & operator*() const
    {
    return *index;
    }

    inline StoreType next() // returns value with iterator post-increment
    {
    auto & retval = *index;
    if (++index >= repEnd)
    index = repStart;
    return retval;
    }
    };

    // Default constructor. Makes a dummy mask for overwriting.
    Masker()
    : initSize(0)
    , repSize(1)
    , mBuffer(1)
    , mStorePtr(*mBuffer)
    {
    *mStorePtr = 0; // nothing marked unprime
    }

    // Move constructor (destroys source)
    Masker(Masker&& source)
    : initSize(source.initSize)
    , repSize(source.repSize)
    , mBuffer(std::move(source.mBuffer))
    , mStorePtr(*mBuffer)
    {
    }

    // Copy constructor (preserves source)
    Masker(Masker& source)
    : initSize(source.initSize)
    , repSize(source.repSize)
    , mBuffer(initSize + repSize)
    , mStorePtr(*mBuffer)
    {
    memcpy(*mBuffer, *source.mBuffer, mBuffer.size() * sizeof(StoreType));
    }

    // move assignment (destroys source)
    Masker& operator=(Masker&& source)
    {
    initSize = source.initSize;
    repSize = source.repSize;
    mStorePtr = source.mStorePtr;
    mBuffer = std::move(source.mBuffer);
    return *this;
    }

    // Construct for a single prime
    Masker(PRIME prime)
    : initSize(this->index(prime)+1)
    , repSize(prime)
    , mBuffer(initSize + prime)
    , mStorePtr(*mBuffer)
    {
    // Pre-zero the buffer, because for bigger primes we
    don't write every word.
    // This is fast enough not to care about excess writes.
    memset(mStorePtr, 0, mBuffer.size() * sizeof(StoreType));

    // The max prime we'll actually store.
    // *2 because we don't store evens.
    PRIME const maxPrime = (initSize + repSize) * CHAR_BIT
    * sizeof(StoreType) * 2;

    // Filter all the bits. Note that although we
    // don't strictly need to filter from prime*3,
    // but only from prime**2, doing from *3 makes
    // the init block smaller.
    PRIME const prime2 = prime * 2;

    for (PRIME value = prime * 3; value < maxPrime; value
    += prime2)
    {
    mStorePtr[this->index(value)] |= this->mask(value);
    }
    }

    // Construct from two others, making a mask that excludes all
    numbers
    // marked not prime in either of them. The output init block
    // will be the largest from either of the inputs; the repeat
    block size will be the
    // product of the input repeat sizes.
    // The output could get quite large - 3.7E12 for all primes up
    to 37
    Masker(const Masker& left, const Masker& right, size_t
    storeLimit = 0)
    : initSize(std::max(left.initSize, right.initSize))
    , repSize (left.repSize * right.repSize)
    , mBuffer()
    {
    if (storeLimit)
    repSize = std::min(initSize + repSize,
    storeLimit) - initSize;

    auto storeSize = initSize + repSize;

    // Actually construct the store with the desired size
    mBuffer = storeSize;
    mStorePtr = *mBuffer;

    // get iterators to the two inputs. These automatically
    wrap
    // when their repsize is reached.
    auto li = left.begin();
    auto ri = right.begin();

    for (size_t word = 0; word < storeSize; ++word)
    {
    mStorePtr[word] = li.next() | ri.next();
    }
    }

    // Construct from several others, making a mask that excludes
    all numbers
    // marked not prime in any of them. The output init block
    // will be the largest from any of the inputs; the repeat size
    will be the
    // product of all the input repeat sizes.
    // The output could get quite large - 3.7E12 for all primes up
    to 37
    // the type iterator should be an iterator into a collection of Masker-s.
    template <typename iterator> Masker(const iterator& begin,
    const iterator& end, size_t storeLimit = 0)
    : initSize(0)
    , repSize(1)
    , mBuffer() // empty for now
    {
    // Iterate over the inputs. We will
    // * Determine the maximum init size
    // * Determine the product of all the repeat sizes.
    // * Record the number of primes we represent.
    size_t nInputs = std::distance(begin, end);
    std::vector<MaskIterator> iterators;
    iterators.reserve(nInputs);

    for (auto i = begin; i != end; ++i)
    {
    initSize = std::max(initSize, i->initSize);
    repSize *= i->repSize;
    iterators.push_back(i->begin());
    }
    if (storeLimit)
    repSize = std::min(initSize + repSize,
    storeLimit) - initSize;
    auto storeSize = initSize + repSize;
    // Actually construct the store with the desired size
    mBuffer = storeSize;
    mStorePtr = *mBuffer;

    // take the last one off (most efficient to remove)
    // and use it as the initial mask value
    auto last = iterators.back();
    iterators.pop_back();
    for (auto word = 0; word < storeSize; ++word)
    {
    StoreType mask = last.next();
    for(auto &i: iterators)
    {
    mask |= i.next();
    }
    mStorePtr[word] = mask;
    }
    }

    template <typename iterator> Masker& andequals(size_t
    cachesize, iterator begin, iterator end)
    {
    static constexpr size_t wordshift = sizeof(StoreType)
    == 1 ? 4 : sizeof(StoreType) == 4 ? 6 : sizeof(StoreType) == 8 ? 7 : 0;
    static constexpr size_t wordmask = (StoreType)-1 >> (sizeof(StoreType) * CHAR_BIT + 1 - wordshift);

    struct value
    {
    PRIME step; // this is the
    prime. The step should be 2P, but only half the bits are stored
    PRIME halfMultiple; // this is the
    current multiple, _without_ the last bit
    value(PRIME prime): step(prime), halfMultiple((prime*prime) >> 1) {}
    bool operator<(PRIME halfLimit) const
    {
    return halfMultiple < halfLimit;
    }
    void next(StoreType * store)
    {
    static constexpr size_t wsm1 =
    wordshift - 1;
    auto index = halfMultiple >> wsm1;
    auto mask = (StoreType)1 <<
    (halfMultiple & wordmask);
    *(store + index) |= mask;
    halfMultiple += step;
    }
    };
    std::vector<value> values;
    values.reserve(std::distance(begin, end));
    for (auto prime = begin; prime != end; ++prime)
    {
    auto p = *prime;
    values.emplace_back(p);
    }

    size_t limit = 0;
    do
    {
    limit = std::min(limit + cachesize, initSize +
    repSize + 1);
    PRIME halfMaxPrime = (limit * 16 *
    sizeof(StoreType)) / 2 + 1;
    for (auto & value: values)
    {
    while (value < halfMaxPrime)
    value.next(mStorePtr);
    }
    }
    while (limit < initSize + repSize);

    return *this;
    }

    // duplicates the data up to the amount needed for a prime newSize
    void resize(PRIME newSize)
    {
    size_t sizeWords = this->index(newSize) + 1;
    assert(sizeWords > (initSize + repSize)); // not
    allowed to shrink!
    mBuffer.resize(sizeWords);
    mStorePtr = *mBuffer;

    auto copySource = mStorePtr + initSize;
    do
    {
    auto copySize = std::min(sizeWords - repSize - initSize, repSize);
    auto dest = copySource + repSize;
    memcpy(dest, copySource, copySize * sizeof(StoreType));
    repSize += copySize;
    } while ((initSize + repSize) < sizeWords);
    }

    // returns true if this mask thinks value is prime.
    bool get(PRIME value) const
    {
    assert(value < sizeof(StoreType) * CHAR_BIT * 2 * mBuffer.size());
    auto ret =
    (value <= 3)
    || ((value & 1) &&
    (mStorePtr[this->index(value)] & this->mask(value)) == 0);

    return ret;
    }

    // Get the beginning of the bitmap.
    // Incrementing this iterator can continue indefinitely,
    regardless of the bitmap size.
    MaskIterator begin() const
    {
    return MaskIterator(this);
    }

    size_t repsize() const
    {
    return repSize;
    }

    size_t size() const
    {
    return initSize + repSize;
    }

    // prints a collection to stderr
    void dump(size_t limit = 0) const
    {
    std::cerr << std::dec << repSize;
    std::cerr << std::hex;

    if (limit == 0) limit = initSize + repSize;

    auto iter = begin();
    for (auto i = 0; i < limit; ++i)
    {
    // cast prevents attempting to print uint8_t as
    a char
    std::cerr << ',' << std::setfill('0') << std::setw(sizeof(StoreType) * 2) << uint64_t(mStorePtr[i]);
    }
    std::cerr << std::endl;
    std::cerr << std::dec << repSize;

    for (auto p = 1; p < limit * 16 * sizeof(StoreType); ++p)
    {
    if (get(p))
    {
    std::cerr << ',' << p;
    }
    }
    std::cerr << std::endl;
    }
    };

    int main(int argc, char**argv)
    {
    // find out if the user asked for a different number of primes
    PRIME PrimeCount = 1e9;
    if (argc >= 2)
    {
    std::istringstream arg1(argv[1]);
    std::string crud;
    arg1 >> PrimeCount >> crud;
    if (!crud.empty() || PrimeCount < 3)
    {
    double isitfloat;
    arg1.seekg(0, arg1.beg);
    crud.clear();
    arg1 >> isitfloat >> crud;
    if (!crud.empty() || isitfloat < 3)
    {
    std::cerr << "Invalid first argument
    \"" << argv[1] << "\" Should be decimal number of primes required\n";
    exit(42);
    }
    else PrimeCount = isitfloat;
    }
    }

    std::string outName;
    if (argc >= 3)
    {
    outName = argv[2];
    }

    Stopwatch s; s.start();

    Masker primes;
    PRIME nextPrime;
    {
    nextPrime = 3;
    Masker tiny(nextPrime);
    std::vector<Masker> smallMasks;
    smallMasks.emplace_back(tiny);

    // Make a mask for the really small primes, 3 5 7 11
    size_t count = 1; // allows for the 3
    for ( ; count < tinyCount; ++count)
    {
    do
    {
    nextPrime += 2;
    } while (!tiny.get(nextPrime));
    tiny = Masker(tiny, nextPrime);
    }

    // that masker for three will be correct up until 5
    squared,
    // the first non-prime whose smallest root is not 2 or 3
    assert (nextPrime < 25 && "If this triggers tinyCount
    is too big and the code will give wrong results");

    // this limit is too small, but is easy to calculate
    and big enough
    auto limit = nextPrime*nextPrime;

    smallMasks.clear();
    for (; count < smallCount; ++count)
    {
    do
    {
    nextPrime += 2;
    } while (!tiny.get(nextPrime));
    smallMasks.emplace_back(nextPrime);
    }
    assert(nextPrime <= limit && "If this triggers
    smallCount is too big and the code will give wrong results");

    // Make a single masker for all the small primes. Don't
    make it bigger than needed.
    auto small = Masker(smallMasks.begin(),
    smallMasks.end(), PrimeCount / (2 * CHAR_BIT * sizeof(StoreType)) + 1);
    s.lap(__LINE__);

    // This is the first step that takes an appreciable
    time. 2.5s for primes up to 1e11 on my machine.
    primes = Masker(tiny, small, PrimeCount / (2 * CHAR_BIT
    * sizeof(StoreType)) + 1);
    s.lap(__LINE__);
    }

    // if the buffer isn't big enough yet expand it by duplicating
    early entries.
    if (primes.size() < PrimeCount / (2 * CHAR_BIT *
    sizeof(StoreType)) + 1)
    {
    primes.resize(PrimeCount);
    s.lap(__LINE__);
    }

    do
    {
    std::vector<PRIME> quickies;
    PRIME quickMaxPrime = std::min(nextPrime*nextPrime, PRIME(sqrt(PrimeCount) + 1));
    do
    {
    do
    {
    nextPrime += 2;
    } while (!primes.get(nextPrime));
    quickies.emplace_back(nextPrime);
    } while (nextPrime < quickMaxPrime);
    s.lap(__LINE__);

    // this function adds all the quickies into the mask,
    but does all of them in
    // one area of memory before moving onto the next. The
    area of memory,
    // and the list of primes, both fit into the L1 cache.
    // You may need to adjust bigCache for best performance.
    // This step takes a while - two lots of 20s for qe11
    primes.
    primes.andequals(bigCache, quickies.begin(),
    quickies.end());
    s.lap(__LINE__);
    } while (nextPrime*nextPrime < PrimeCount);

    s.stop();
    std::cerr << primes.size() << "," << s.laps().back().second << std::endl;
    for (auto l: s.laps()) std::cerr << l.first << "," << l.second
    << std::endl;

    if (outName.length())
    {
    std::ofstream outfile(outName);
    outfile << "2\n";
    for (PRIME prime = 3; prime < PrimeCount; prime += 2)
    {
    if (primes.get(prime))
    {
    outfile << prime << '\n';
    }
    }
    }

    return 0;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andrey Tarasevich@21:1/5 to red floyd on Sat Sep 28 15:21:29 2024
    On 08/22/24 5:00 PM, red floyd wrote:
    You can skip multiples of three, by starting with 7, and then
    alternately adding 4 then 2.

    ... by starting with 5 and then alternately adding 2 then 4, if you want
    to be follow the general idea of wheel factorization.

    Or by starting with 7 and then adding cyclically { 4, 2, 4, 2, 4, 6, 2, 6 }.

    And so forth...

    --
    Best regards,
    Andrey

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Keith Thompson on Wed Oct 2 11:44:50 2024
    On 28/09/2024 21:49, Keith Thompson wrote:
    Tim Rentsch <[email protected]> writes:
    Vir Campestris <[email protected]d> writes:

    [...]

    I tried to post a short followup, but I must have done something
    wrong because nothing went out. Long story short, I've been
    pulled away from looking at this further due to other tasks
    demanding my attention, and also because the Sieve of Aiken
    looks more promising (having been reminded recently of
    primegen by Daniel Bernstein).

    Typo: Sieve of Atkin.

    https://en.wikipedia.org/wiki/Sieve_of_Atkin

    In any case, thank you for the back and forth, it's been fun.

    Well, the holiday was fun (the Zeppelin museum on the Bodensee was
    especially interesting for geeks) but my wife picked up what we think
    was COVID in one of the hotels so she was a bit **** towards the end.

    Then she gave it to me :(

    However...

    I now have a working version storing all primes as (prime/30
    mask(prime%30) and tuned up.

    The disappointing thing is that it's not until you get to seriously big
    numbers - something in the 1e9 range - that this is faster than the code
    I posted before. And even then it's not _much_ faster. About 20% at 1e11.

    Over in comp.lang.c this came up too, and there's a link there to the
    Sieve of Atkin

    <https://en.wikipedia.org/wiki/Sieve_of_Atkin>

    and to a program

    <http://cr.yp.to/primegen.html>

    which implements it.

    I'm pleased to be able to report that that program, written for a 1998
    CPU, is slower than mine on my modern CPU.

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to All on Sat Sep 28 03:46:21 2024
    Vir Campestris <[email protected]d> writes:

    [...]

    I tried to post a short followup, but I must have done something
    wrong because nothing went out. Long story short, I've been
    pulled away from looking at this further due to other tasks
    demanding my attention, and also because the Sieve of Aiken
    looks more promising (having been reminded recently of
    primegen by Daniel Bernstein).

    In any case, thank you for the back and forth, it's been fun.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Mon Oct 7 08:41:21 2024
    Vir Campestris <[email protected]d> writes:

    On 28/09/2024 21:49, Keith Thompson wrote:

    Tim Rentsch <[email protected]> writes:

    Vir Campestris <[email protected]d> writes:

    [...]

    I tried to post a short followup, but I must have done something
    wrong because nothing went out. Long story short, I've been
    pulled away from looking at this further due to other tasks
    demanding my attention, and also because the Sieve of Aiken
    looks more promising (having been reminded recently of
    primegen by Daniel Bernstein).

    Typo: Sieve of Atkin.

    https://en.wikipedia.org/wiki/Sieve_of_Atkin

    In any case, thank you for the back and forth, it's been fun.

    [...]

    I now have a working version storing all primes as (prime/30
    mask(prime%30) and tuned up.

    The disappointing thing is that it's not until you get to seriously
    big numbers - something in the 1e9 range - that this is faster than
    the code I posted before. And even then it's not _much_ faster. About
    20% at 1e11.

    This reaction, more broadly, has been rather surprising. The whole
    point of using a sieve is that it is asymptotically faster. Who
    cares about how fast primes under a billion, or ten billion, or
    fifty billion, can be computed? It's much more interesting to ask
    how high a limit can be searched in a day or a week of computing.
    (I don't mean just your comment, but comments from other folks
    bragging about how fast they can find all 32-bit primes.) I ran
    my earlier programs up to limits of 3 trillion or so.

    Over in comp.lang.c this came up too, and there's a link there to the
    Sieve of Atkin

    <https://en.wikipedia.org/wiki/Sieve_of_Atkin>

    and to a program

    <http://cr.yp.to/primegen.html>

    which implements it.

    I saw this link but didn't play around with the program.

    I'm pleased to be able to report that that program, written for a 1998
    CPU, is slower than mine on my modern CPU.

    What has been interesting is that these days prime-finding programs
    really need to be tailored to the hardware they will be run on.
    Somehow that seems like a step backwards. Fifty years ago programmers
    always thought about low-level hardware characteristics when writing
    programs. Over time the fraction of time spent worrying about such
    things has gotten smaller and smaller, to the point now where it is
    almost never important. But wanting to write a fast prime-finding
    program has taken us back to the glory days of yesteryear. ;)

    Incidentally, I found a program primesieve that's available under
    Ubuntu Linux. On a whim I had it find primes less than 2**49,
    which took slightly more than 12 hours elapsed time (running on 12
    cores). If computing primes up to N, a nice ratio to compute is

    (number of primes less than N) * log N / N

    which converges to 1, but very slowly. For primes less than 2**49
    I got 1.03135087927594382.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Vir Campestris@21:1/5 to Tim Rentsch on Sun Oct 20 12:44:21 2024
    On 07/10/2024 16:41, Tim Rentsch wrote:
    This reaction, more broadly, has been rather surprising. The whole
    point of using a sieve is that it is asymptotically faster. Who
    cares about how fast primes under a billion, or ten billion, or
    fifty billion, can be computed? It's much more interesting to ask
    how high a limit can be searched in a day or a week of computing.
    (I don't mean just your comment, but comments from other folks
    bragging about how fast they can find all 32-bit primes.) I ran
    my earlier programs up to limits of 3 trillion or so.

    My programs all require the sieved prime map to fit in RAM. Which in my
    system limits me to about 10^11 primes.

    It has occurred to me to extend it for much larger numbers, by paging
    out to disc files. But I've wasted far too much time on this already!

    Andy

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Vir Campestris on Mon Nov 4 03:56:28 2024
    Vir Campestris <[email protected]d> writes:

    On 07/10/2024 16:41, Tim Rentsch wrote:

    This reaction, more broadly, has been rather surprising. The whole
    point of using a sieve is that it is asymptotically faster. Who
    cares about how fast primes under a billion, or ten billion, or
    fifty billion, can be computed? It's much more interesting to ask
    how high a limit can be searched in a day or a week of computing.
    (I don't mean just your comment, but comments from other folks
    bragging about how fast they can find all 32-bit primes.) I ran
    my earlier programs up to limits of 3 trillion or so.

    My programs all require the sieved prime map to fit in RAM. Which in
    my system limits me to about 10^11 primes.

    It has occurred to me to extend it for much larger numbers, by paging
    out to disc files. But I've wasted far too much time on this already!

    I propose a new challenge: compute all primes up to a large number
    (e.g., 10e12) with a program that has a small memory footprint (and
    no storing intermediate values on disk, etc).

    I wrote a small program in C, with a memory footprint well under
    one megabyte, and found all primes under 10240000000000 (all it
    did was count them, which I checked using primesieve). About 150
    lines of fairly routine C code.

    If you want a sanity check there are 354080024725 primes less
    than 10240000000000.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)