• constexpr is really very smart!

    From Student Project@21:1/5 to All on Sun Dec 15 20:20:42 2024
    The constexpr is really very smart because it can speed up algorithms
    1000 times according to Dave, Microsoft retired engineer. He has proved it
    by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like in his example. It was almost instant at the blink of the eyes.

    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 270
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 257
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 171
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 176

    Amazing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Student Project on Mon Dec 16 11:28:08 2024
    On Sun, 15 Dec 2024 20:20:42 +0000
    Student Project <[email protected]d> wrote:

    The constexpr is really very smart because it can speed up algorithms
    1000 times according to Dave, Microsoft retired engineer. He has
    proved it by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like in
    his example. It was almost instant at the blink of the eyes.

    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 270
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 257
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 171
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 176

    Amazing.


    I didn't see the video (I never see that type of videos), but 270
    microseconds sound astonishingly slow for fib(35).

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }

    int main(int argz, char** argv)
    {
    if (argz < 2) {
    printf("Go away!\n");
    return 1;
    }

    char* endp;
    long n = strtol(argv[1], &endp, 0);
    if (endp == argv[1]) {
    printf("%s is not a number. Go away!\n", argv[1]);
    return 1;
    }

    unsigned long long t0 = _rdtsc();
    long long f = fib(n);
    unsigned long long t1 = _rdtsc();
    printf("fib(%ld)=%lld. %lld clock cycles.\n", n, f, t1 - t0);
    return 0;
    }

    $ gcc -s -O2 -Wall fib.c -o fib
    $ ./fib 35
    fib(35)=9227465. 212 clock cycles.

    212 cycles on this 11 y.o. CPU means 62 nanoseconds. Even that time, I
    would guess, is mostly a measurement overhead.
    $ ./fib 92
    fib(92)=7540113804746346429. 281 clock cycles.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Dec 16 11:18:46 2024
    On 16/12/2024 10:28, Michael S wrote:
    On Sun, 15 Dec 2024 20:20:42 +0000
    Student Project <[email protected]d> wrote:

    The constexpr is really very smart because it can speed up algorithms
    1000 times according to Dave, Microsoft retired engineer. He has
    proved it by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like in
    his example. It was almost instant at the blink of the eyes.

    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 270
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 257
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 171
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 176

    Amazing.


    I didn't see the video (I never see that type of videos), but 270 microseconds sound astonishingly slow for fib(35).

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }


    There are a dozen different ways to calculate Fibonacci numbers, ranging
    from very fast to very slow - but demonstrating different things. Even
    your program is going to be very slow compared to just using φⁿ / √5.

    My guess is that the OP is referring to some kind of naïve recursive
    Fibonacci implementation. If the function is declared "constexpr" and
    the values used are compile-time constants, then maybe the use of
    "constexpr" leads to more of it being pre-calculated at compile time.

    The OP is right that "constexpr" can be very "smart" - but I'm not sure
    this is the best calculation for demonstrating that. However, I have
    not watched the video either, and maybe it's well presented.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Mon Dec 16 13:23:15 2024
    On Mon, 16 Dec 2024 11:18:46 +0100
    David Brown <[email protected]> wrote:

    On 16/12/2024 10:28, Michael S wrote:
    On Sun, 15 Dec 2024 20:20:42 +0000
    Student Project <[email protected]d> wrote:

    The constexpr is really very smart because it can speed up
    algorithms 1000 times according to Dave, Microsoft retired
    engineer. He has proved it by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like
    in his example. It was almost instant at the blink of the eyes.

    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 270
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 257
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 171
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 176

    Amazing.


    I didn't see the video (I never see that type of videos), but 270 microseconds sound astonishingly slow for fib(35).

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }


    There are a dozen different ways to calculate Fibonacci numbers,
    ranging from very fast to very slow - but demonstrating different
    things. Even your program is going to be very slow compared to just
    using φⁿ / √5.


    My news reader is not sure about equation above.
    You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))
    It works well with double precision math up to n=75. After that it
    would need high precision fp library, so probably [much] slower than
    simple loop above at least up to n=92 where we run out of range of int64 result.

    My guess is that the OP is referring to some kind of naïve recursive Fibonacci implementation.

    Something like that ?
    static long long slow_fib(long n)
    {
    if (n < 1)
    return 0;
    if (n == 1)
    return 1;
    return slow_fib(n-2)+slow_fib(n-1);
    }

    Yes, 270 usec could be considered fast relatively to that. slow_fib(35)
    takes 20 msec on my old PC.

    If the function is declared "constexpr"
    and the values used are compile-time constants, then maybe the use of "constexpr" leads to more of it being pre-calculated at compile time.


    But then, how long would it take to compile?
    I'd guess for n=35 it's not too bad, but what about n=92?
    Slow algorithm is a slow algorithm. Doing it in compile time can only exaggerate the slowness.

    The OP is right that "constexpr" can be very "smart" - but I'm not
    sure this is the best calculation for demonstrating that. However, I
    have not watched the video either, and maybe it's well presented.


    If it is video and it is about programming then it can not be good.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Mon Dec 16 13:43:11 2024
    On 16/12/2024 12:23, Michael S wrote:
    On Mon, 16 Dec 2024 11:18:46 +0100
    David Brown <[email protected]> wrote:

    On 16/12/2024 10:28, Michael S wrote:
    On Sun, 15 Dec 2024 20:20:42 +0000
    Student Project <[email protected]d> wrote:

    The constexpr is really very smart because it can speed up
    algorithms 1000 times according to Dave, Microsoft retired
    engineer. He has proved it by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like
    in his example. It was almost instant at the blink of the eyes.

    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 270
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 257
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 171
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 176

    Amazing.


    I didn't see the video (I never see that type of videos), but 270
    microseconds sound astonishingly slow for fib(35).

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }


    There are a dozen different ways to calculate Fibonacci numbers,
    ranging from very fast to very slow - but demonstrating different
    things. Even your program is going to be very slow compared to just
    using φⁿ / √5.


    My news reader is not sure about equation above.
    You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))

    Yes.

    It works well with double precision math up to n=75. After that it
    would need high precision fp library, so probably [much] slower than
    simple loop above at least up to n=92 where we run out of range of int64 result.

    Sure, there comes a limit to what works here. Doubles don't have the
    precision range of 64-bit integers. (You may also see inaccuracies
    build up with floating point even before the range of the mantissa
    bits.) Similarly, there is a limit to how high you can get with 64-bit integers. But certainly multi-precision integer arithmetic will be a
    lot faster than multi-precision floating point.

    There's also a big difference if you want to generate the sequence, or
    just the n'th number, or if you are only interested in the size of the
    n'th number, or some other property.



    My guess is that the OP is referring to some kind of naïve recursive
    Fibonacci implementation.

    Something like that ?
    static long long slow_fib(long n)
    {
    if (n < 1)
    return 0;
    if (n == 1)
    return 1;
    return slow_fib(n-2)+slow_fib(n-1);
    }

    Yes, 270 usec could be considered fast relatively to that. slow_fib(35)
    takes 20 msec on my old PC.


    That's what I was guessing, yes. Calculating Fibonacci numbers is often
    used as an example for showing how to write recursive functions with a structure like the one above, then looking at different techniques for improving them - such as using a recursive function that takes (n, a, b)
    and returns (n - 1, b, a + b), or perhaps memoization. (You could
    probably teach a term's worth of a Haskell course based entirely on formulations for calculating Fibonacci functions!)

    It can also be an interesting exercise to look at the speed of
    calculating Fibonacci numbers for different sizes. I expect your linear integer function will be fastest for small n, then powers of phi will be
    faster for a bit, then you'd switch back to integers of progressively
    larger but fixed sizes using the recursive formula for double n :

    fib(2 * n - 1) = fib(n) ** 2 + fib(n - 1) ** 2
    fib(2 * n) = (2 * fib(n - 1) + fib(n)) * fib(n)

    As you get even bigger, you would then want to use fancier ways to do
    the multiplication - I don't know if those can be used in a way that
    interacts well with formulas for the Fibonacci numbers.

    I guess that can be left as an exercise for the OP !


    If the function is declared "constexpr"
    and the values used are compile-time constants, then maybe the use of
    "constexpr" leads to more of it being pre-calculated at compile time.


    But then, how long would it take to compile?
    I'd guess for n=35 it's not too bad, but what about n=92?
    Slow algorithm is a slow algorithm. Doing it in compile time can only exaggerate the slowness.


    My experience is that gcc will pre-calculate for a certain depth of
    recursion here, then give up and generate run-time code. But you would
    not have to pre-calculate or expand very far before it made a fair
    difference to the run-time of slow_fib(35).

    The OP is right that "constexpr" can be very "smart" - but I'm not
    sure this is the best calculation for demonstrating that. However, I
    have not watched the video either, and maybe it's well presented.


    If it is video and it is about programming then it can not be good.


    :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to David Brown on Tue Dec 17 11:08:02 2024
    On Mon, 16 Dec 2024 13:43:11 +0100
    David Brown <[email protected]> wrote:

    On 16/12/2024 12:23, Michael S wrote:
    On Mon, 16 Dec 2024 11:18:46 +0100
    David Brown <[email protected]> wrote:

    On 16/12/2024 10:28, Michael S wrote:



    My guess is that the OP is referring to some kind of naïve
    recursive Fibonacci implementation.

    Something like that ?
    static long long slow_fib(long n)
    {
    if (n < 1)
    return 0;
    if (n == 1)
    return 1;
    return slow_fib(n-2)+slow_fib(n-1);
    }

    Yes, 270 usec could be considered fast relatively to that.
    slow_fib(35) takes 20 msec on my old PC.


    That's what I was guessing, yes. Calculating Fibonacci numbers is
    often used as an example for showing how to write recursive functions
    with a structure like the one above, then looking at different
    techniques for improving them - such as using a recursive function
    that takes (n, a, b) and returns (n - 1, b, a + b), or perhaps
    memoization. (You could probably teach a term's worth of a Haskell
    course based entirely on formulations for calculating Fibonacci
    functions!)

    It can also be an interesting exercise to look at the speed of
    calculating Fibonacci numbers for different sizes. I expect your
    linear integer function will be fastest for small n, then powers of
    phi will be faster for a bit, then you'd switch back to integers of progressively larger but fixed sizes using the recursive formula for
    double n :

    fib(2 * n - 1) = fib(n) ** 2 + fib(n - 1) ** 2
    fib(2 * n) = (2 * fib(n - 1) + fib(n)) * fib(n)

    As you get even bigger, you would then want to use fancier ways to do
    the multiplication - I don't know if those can be used in a way that interacts well with formulas for the Fibonacci numbers.


    There exist precise methods of integer multiplication with complexity
    of O(N*Log(N)) where N is number of digits. Personally, I never had a
    need for them, but I believe that they work.

    I guess that can be left as an exercise for the OP !


    Since OP's ideas of "fast" are very humble, he is likely to be
    satisfied with the very first step above naive:

    static long long less_slow_fib(long n)
    {
    if (n < 3)
    return n < 1 ? 0 : 1;
    return less_slow_fib(n-2)*2+less_slow_fib(n-3);
    }

    On my old PC is calculates fib(35) in 24 usec. That is more than
    10 times faster than his idea of fast.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Michael S on Tue Dec 17 11:10:45 2024
    On 17/12/2024 10:08, Michael S wrote:
    On Mon, 16 Dec 2024 13:43:11 +0100
    David Brown <[email protected]> wrote:

    On 16/12/2024 12:23, Michael S wrote:
    On Mon, 16 Dec 2024 11:18:46 +0100
    David Brown <[email protected]> wrote:

    On 16/12/2024 10:28, Michael S wrote:



    My guess is that the OP is referring to some kind of naïve
    recursive Fibonacci implementation.

    Something like that ?
    static long long slow_fib(long n)
    {
    if (n < 1)
    return 0;
    if (n == 1)
    return 1;
    return slow_fib(n-2)+slow_fib(n-1);
    }

    Yes, 270 usec could be considered fast relatively to that.
    slow_fib(35) takes 20 msec on my old PC.


    That's what I was guessing, yes. Calculating Fibonacci numbers is
    often used as an example for showing how to write recursive functions
    with a structure like the one above, then looking at different
    techniques for improving them - such as using a recursive function
    that takes (n, a, b) and returns (n - 1, b, a + b), or perhaps
    memoization. (You could probably teach a term's worth of a Haskell
    course based entirely on formulations for calculating Fibonacci
    functions!)

    It can also be an interesting exercise to look at the speed of
    calculating Fibonacci numbers for different sizes. I expect your
    linear integer function will be fastest for small n, then powers of
    phi will be faster for a bit, then you'd switch back to integers of
    progressively larger but fixed sizes using the recursive formula for
    double n :

    fib(2 * n - 1) = fib(n) ** 2 + fib(n - 1) ** 2
    fib(2 * n) = (2 * fib(n - 1) + fib(n)) * fib(n)

    As you get even bigger, you would then want to use fancier ways to do
    the multiplication - I don't know if those can be used in a way that
    interacts well with formulas for the Fibonacci numbers.


    There exist precise methods of integer multiplication with complexity
    of O(N*Log(N)) where N is number of digits. Personally, I never had a
    need for them, but I believe that they work.

    I've never needed them either. In fact, in the case of the O(n . log n) algorithm, /nobody/ has needed it - the algorithm is only an improvement
    over higher big-O complexity algorithms for at least 2 ** 1729 ** 12
    digits. That means you need 39 decimal digits just to express the
    number of digits in the numbers you are going to multiply. Maybe the
    factor of 1729 will be reduced somewhat in the future, but I really
    don't see it being added to the GMP library any time soon :-)

    <https://mattermodeling.stackexchange.com/questions/1355/did-the-2019-discovery-of-on-logn-multiplication-have-a-practical-outcome>


    There are other algorithms that have practical uses when dealing with
    /really/ big numbers. But I think it will be very rare to need anything
    beyond Karatsuba, which is reasonably easy to understand - it splits the numbers into two halves, does a bit of re-arrangement, and ends up with
    3 smaller multiplications and some additions and subtractions instead of
    the traditional 4 smaller multiplications.

    (You'll find Karatsuba in RSA algorithms, but no algorithms beyond that.)



    I guess that can be left as an exercise for the OP !


    Since OP's ideas of "fast" are very humble, he is likely to be
    satisfied with the very first step above naive:

    static long long less_slow_fib(long n)
    {
    if (n < 3)
    return n < 1 ? 0 : 1;
    return less_slow_fib(n-2)*2+less_slow_fib(n-3);
    }

    On my old PC is calculates fib(35) in 24 usec. That is more than
    10 times faster than his idea of fast.



    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Tue Dec 17 12:54:19 2024
    Michael S <[email protected]> writes:

    On Sun, 15 Dec 2024 20:20:42 +0000
    Student Project <[email protected]d> wrote:

    The constexpr is really very smart because it can speed up
    algorithms 1000 times according to Dave, Microsoft retired
    engineer. He has proved it by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like
    in his example. It was almost instant at the blink of the eyes.

    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 270
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 257
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 171
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 176

    Amazing.

    I didn't see the video (I never see that type of videos), but 270 microseconds sound astonishingly slow for fib(35).

    Slow for the problem, but not slow for the algorithm. The
    point of the video was to compare relative speeds of an
    algorithm under two different compilation schemes (with and
    without constexpr), not to compare absolute speeds to solve
    the problem of computing fibonacci numbers.

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }

    Here is my second fastest fibonacci calculation code (for
    relatively small inputs):

    typedef long unsigned long ULL;

    ULL
    fibonacci( unsigned n ){
    ULL b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    ULL na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    My fastest fibonacci code uses recursion. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Wed Dec 18 01:33:42 2024
    On Tue, 17 Dec 2024 12:54:19 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Sun, 15 Dec 2024 20:20:42 +0000
    Student Project <[email protected]d> wrote:

    The constexpr is really very smart because it can speed up
    algorithms 1000 times according to Dave, Microsoft retired
    engineer. He has proved it by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like
    in his example. It was almost instant at the blink of the eyes.

    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 270
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 257
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 171
    D:\CmdLine\C_Cpp\Chrono02>program
    Fibonacci_c: 9227465
    Time Taken: 176

    Amazing.

    I didn't see the video (I never see that type of videos), but 270 microseconds sound astonishingly slow for fib(35).

    Slow for the problem, but not slow for the algorithm. The
    point of the video was to compare relative speeds of an
    algorithm under two different compilation schemes (with and
    without constexpr), not to compare absolute speeds to solve
    the problem of computing fibonacci numbers.

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }

    Here is my second fastest fibonacci calculation code (for
    relatively small inputs):

    typedef long unsigned long ULL;

    ULL
    fibonacci( unsigned n ){
    ULL b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    ULL na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }




    From BigO perspective this code looks like it's still O(n) so no better
    than simple loop.
    According to my measurement gear, in range 0 to 92 there are few points
    where it is faster than simple loop, but in majority of cases it is
    slower.


    My fastest fibonacci code uses recursion. :)


    There is 1st atempts at recursive code with better than exponential BigO

    static long long fib(long n)
    {
    if (n < 3)
    return n < 1 ? 0 : 1;
    long a = n/2;
    long b = n-a;
    return fib(a)*fib(b-1)+fib(a+1)*fib(b);
    }

    O(n*n) - not exponential but still pretty bad.
    A small obvious modification (specialization) turns it into O(n):

    static long long fib(long n)
    {
    if (n < 3)
    return n < 1 ? 0 : 1;
    long a = n/2;
    long b = n-a;
    if (a == b) {
    long long f0 = fib(a-1);
    long long f1 = fib(a);
    long long f2 = f0+f1;
    return (f0+f2)*f1;
    } else { // b == a+1
    long long f0 = fib(a);
    long long f1 = fib(a+1);
    return f0*f0+f1*f1;
    }
    }

    Still, despite being the same as yours and as a simple loop in BigO
    sence, it is several times slower in absolute speed.

    I am pretty sure that better variant exists, but for tonight it's
    enough.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Student Project@21:1/5 to Keith Thompson on Wed Dec 18 04:45:20 2024
    On 17/12/2024 20:20, Keith Thompson wrote:
    Student Project <[email protected]d> writes:
    The constexpr is really very smart because it can speed up algorithms
    1000 times according to Dave, Microsoft retired engineer. He has proved it >> by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like in his
    example. It was almost instant at the blink of the eyes.

    Can you post the relevant code?

    [...]


    Code in the boxed area below: (The purpose is to demonstrate constexpr NOT the recursive algorithm. The video is very clear about this).


    It is the same code as in the video. G++ gives you the best result after changing one line to:

    /*constexpr*/ int result_c = fibonacci_c(num);

    G++ result is (multiple runs) - All timings in milliseconds: D:\CmdLine\C_Cpp\Chrono05>program
    Fibonacci_c 9227465
    Time taken: 9
    Fibonacci 9227465
    Time taken: 289

    D:\CmdLine\C_Cpp\Chrono05>program
    Fibonacci_c 9227465
    Time taken: 4
    Fibonacci 9227465
    Time taken: 276

    D:\CmdLine\C_Cpp\Chrono05>program
    Fibonacci_c 9227465
    Time taken: 8
    Fibonacci 9227465
    Time taken: 284

    D:\CmdLine\C_Cpp\Chrono05>program
    Fibonacci_c 9227465
    Time taken: 5
    Fibonacci 9227465
    Time taken: 276

    clang++ and Visual studio are the slowest. I don't know why.



    <+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>
    #include <iostream>
    #include <chrono>

    int fibonacci(int n)
    {
    if (n <= 1)
    return n;
    return fibonacci(n - 2) + fibonacci(n - 1);
    }

    constexpr int fibonacci_c(int n)
    {
    if (n <= 1)
    return n;
    return fibonacci_c(n - 2) + fibonacci_c(n - 1);
    }

    int main(void)
    {
    // using namespace std::literals::chrono_literals;
    auto start = std::chrono::high_resolution_clock::now();
    constexpr int num = 35;
    /*constexpr*/ int result_c = fibonacci_c(num);

    std::cout << "Fibonacci_c " << result_c << "\n";
    std::cout << "Time taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()
    - start).count() << "\n";

    start = std::chrono::high_resolution_clock::now();
    int result = fibonacci(num);
    std::cout << "Fibonacci " << result << "\n";
    std::cout << "Time taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()
    - start).count() << "\n";

    }

    <+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Wed Dec 18 13:51:31 2024
    On Wed, 18 Dec 2024 01:33:42 +0200
    Michael S <[email protected]> wrote:


    I am pretty sure that better variant exists, but for tonight it's
    enough.


    Morning fibs.

    Recursive:

    struct pair_t {
    long long a,b;
    };

    // return .a=fib(n), .b=fib(n+1)
    static struct pair_t fib2(long n)
    {
    if (n <= 0) {
    struct pair_t ret = { .a = 0, .b = n==0 ? 1 : 0 };
    return ret;
    }
    // for n > 0
    // fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
    // fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
    long m = (n-1) >> 1;
    struct pair_t ret = fib2(m);
    long long a = ret.a, b = ret.b;
    long long c = a + b;
    if ((n & 1)==0) { // (m,m+1) => ((m+1)*2,(m+1)*2*2+1)
    ret.a = (a + c)*b;
    ret.b = b*b + c*c;
    } else { // (m,m+1) => (m*2+1,(m+1)*2)
    ret.a = a*a + b*b;
    ret.b = (a + c)*b;
    }
    return ret;
    }

    static long long fib(long n)
    {
    struct pair_t x = fib2(n-1);
    return x.b;
    }

    Iterative:

    static long long fib(long n)
    {
    if (n <= 0)
    return 0;
    // find MS bit
    unsigned long tmp = n;
    unsigned long bit = 1;
    while (tmp > 1) {
    bit += bit;
    tmp /= 2;
    }
    // for n > 0
    // fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
    // fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
    long long a = 0, b = 1;
    while (bit > 1) {
    long long c = a + b; // fib(n+1)
    bit /= 2;
    if ((n & bit)==0) { // (n-1,n) => (n*2-1,n*2)
    c += a;
    a = a*a + b*b;
    b = c*b;
    } else { // (n-1,n) => (n*2,n*2+1)
    a = (a + c)*b;
    b = b*b + c*c;
    }
    }
    return b;
    }


    Both variants above are O(log(n)).
    In practice for n in range 0 to 92 my measurement gear does not detect differences in speed vs simple loop. If anything, simple loop is a
    little faster.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Wed Dec 18 03:26:15 2024
    Michael S <[email protected]> writes:

    On Mon, 16 Dec 2024 11:18:46 +0100
    David Brown <[email protected]> wrote:

    On 16/12/2024 10:28, Michael S wrote:

    On Sun, 15 Dec 2024 20:20:42 +0000
    Student Project <[email protected]d> wrote:

    The constexpr is really very smart because it can speed up
    algorithms 1000 times according to Dave, Microsoft retired
    engineer. He has proved it by creating this video:

    <https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

    On my computer it took 270 microseconds to calculate fib(35) like
    in his example. It was almost instant at the blink of the eyes.

    [...]

    Amazing.

    I didn't see the video (I never see that type of videos), but 270
    microseconds sound astonishingly slow for fib(35).

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }

    There are a dozen different ways to calculate Fibonacci numbers,
    ranging from very fast to very slow - but demonstrating different
    things. Even your program is going to be very slow compared to
    just using
    [garbled; presumably ceil(pow((1+sqrt(5))/2, n)/sqrt(5))]

    No, it won't. The well-known formula for fibonacci numbers is
    mathematically exact but numerically poor; it gets wrong answers
    even for outputs well under 64 bits. It isn't all that fast either.

    My news reader is not sure about equation above.
    You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))
    It works well with double precision math up to nu. After that it
    would need high precision fp library, so probably [much] slower
    than simple loop above at least up to n' where we run out of range
    of int64 result.

    Right. The simple loop is faster.

    My guess is that the OP is referring to some kind of naive
    recursive Fibonacci implementation.

    Something like that ?
    static long long slow_fib(long n)
    {
    if (n < 1)
    return 0;
    if (n == 1)
    return 1;
    return slow_fib(n-2)+slow_fib(n-1);
    }

    Yes, 270 usec could be considered fast relatively to that.
    slow_fib(35) takes 20 msec on my old PC.

    If the function is declared "constexpr" and the values used are
    compile-time constants, then maybe the use of "constexpr" leads
    to more of it being pre-calculated at compile time.

    But then, how long would it take to compile?
    I'd guess for n5 it's not too bad, but what about n'?
    Slow algorithm is a slow algorithm. Doing it in compile time can
    only exaggerate the slowness.

    That isn't right. A computation done at compile time isn't
    obliged to use the same algorithm; it is obliged only to get the
    same results that the code would. There is a simple way to take
    advantage of a function like the one shown above, but constexpr,
    to get a pre-compiled result much faster than simply plodding
    through a simulation. (Hint: it's a pure function whose output
    depends only on its input, and only a small number of inputs are
    needed for the overall result.)

    The OP is right that "constexpr" can be very "smart" - but I'm
    not sure this is the best calculation for demonstrating that.
    However, I have not watched the video either, and maybe it's well
    presented.

    If it is video and it is about programming then it can not be good.

    I think the video is not bad. Its primary purpose is to talk about
    uses of constexpr that are not completely simple. I've already
    played around with such uses so the video didn't have much to say to
    me, but for someone who hasn't done that there is a good chance it
    will help them go up the learning curve. At this stage of my
    programming education I don't get much benefit from tutorial-type
    content, but I think there are plenty of people who do.

    If I may express a personal view, while I don't begrudge the
    existence of tutorial material, I do find it irksome that there
    is very little material that addresses a more advanced audience,
    people who have a lot of background and experience but simply
    lack knowledge in a particular area. I'm tired of slogging
    through mush that talks to its readers like they are all still
    learning the basics. Being ignorant is not the same as being
    uneducated (and really not the same as being stupid.. ick).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Wed Dec 18 14:39:22 2024
    On Wed, 18 Dec 2024 03:26:15 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:


    My news reader is not sure about equation above.
    You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))
    It works well with double precision math up to nu. After that it
    would need high precision fp library, so probably [much] slower
    than simple loop above at least up to n' where we run out of range
    of int64 result.


    It seems like your newsreader don't like combinations of characters that
    I considered benign. Like n=75 and n=92
    In case you want to see what was an original, it is here: https://www.novabbs.com/devel/article-flat.php?id=6042&group=comp.lang.c%2B%2B#6042

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Student Project on Wed Dec 18 14:17:40 2024
    On Wed, 18 Dec 2024 04:45:20 +0000
    Student Project <[email protected]d> wrote:


    <+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>
    #include <iostream>
    #include <chrono>

    int fibonacci(int n)
    {
    if (n <= 1)
    return n;
    return fibonacci(n - 2) + fibonacci(n - 1);
    }

    constexpr int fibonacci_c(int n)
    {
    if (n <= 1)
    return n;
    return fibonacci_c(n - 2) + fibonacci_c(n - 1);
    }

    int main(void)
    {
    // using namespace std::literals::chrono_literals;
    auto start = std::chrono::high_resolution_clock::now();
    constexpr int num = 35;
    /*constexpr*/ int result_c = fibonacci_c(num);

    std::cout << "Fibonacci_c " << result_c << "\n";
    std::cout << "Time taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()
    - start).count() << "\n";

    start = std::chrono::high_resolution_clock::now();
    int result = fibonacci(num);
    std::cout << "Fibonacci " << result << "\n";
    std::cout << "Time taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()
    - start).count() << "\n";

    }

    <+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


    It is likely that at least part of the time you are measuring cout print
    time instead of calculations time you are interested in.
    Also, if you are using gcc on Windows then there is a significant chance
    that implementation of high_resolution_clock is broken. steady_clock is
    more reliable.

    Try following:
    volatile dummy = 0;
    auto t00 = std::chrono::steady_clock::now();// first call can be slow
    dummy = 1;
    auto t0 = std::chrono::steady_clock::now();
    dummy = 2;
    auto result_c = fibonacci_c(num);
    dummy = 3;
    auto t1 = std::chrono::steady_clock::now();
    dummy = 4;
    auto result = fibonacci(num);
    dummy = 5;
    auto t2 = std::chrono::steady_clock::now();
    dummy = 3;

    etc...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Wed Dec 18 07:42:39 2024
    Michael S <[email protected]> writes:

    On Wed, 18 Dec 2024 03:26:15 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:


    My news reader is not sure about equation above.
    You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))
    It works well with double precision math up to nu. After that it
    would need high precision fp library, so probably [much] slower
    than simple loop above at least up to n' where we run out of range
    of int64 result.


    It seems like your newsreader don't like combinations of characters that
    I considered benign. Like n=75 and n=92
    In case you want to see what was an original, it is here:
    [link]

    Yes, that is peculiar. I should look into that. Thank you
    for bringing it to my attention.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Wed Dec 18 09:45:38 2024
    Michael S <[email protected]> writes:

    On Tue, 17 Dec 2024 12:54:19 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    [...]
    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }

    Here is my second fastest fibonacci calculation code (for
    relatively small inputs):

    typedef long unsigned long ULL;

    ULL
    fibonacci( unsigned n ){
    ULL b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    ULL na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    From BigO perspective this code looks like it's still O(n) so no
    better than simple loop.

    It is O(n) but it has a smaller constant.

    According to my measurement gear, in range 0 to 92 there are few
    points where it is faster than simple loop, but in majority of
    cases it is slower.

    I'm at a loss to understand how this could happen. In my own
    measurements, the code shown above runs faster than a simple loop in
    all cases above n > 12, more than twice as fast when n > 17, more
    than three times as fast when n > 42, and going up from there. What
    might account for these radically different results?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Wed Dec 18 22:00:06 2024
    On Wed, 18 Dec 2024 09:45:38 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Tue, 17 Dec 2024 12:54:19 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    [...]
    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }

    Here is my second fastest fibonacci calculation code (for
    relatively small inputs):

    typedef long unsigned long ULL;

    ULL
    fibonacci( unsigned n ){
    ULL b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    ULL na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    From BigO perspective this code looks like it's still O(n) so no
    better than simple loop.

    It is O(n) but it has a smaller constant.

    According to my measurement gear, in range 0 to 92 there are few
    points where it is faster than simple loop, but in majority of
    cases it is slower.

    I'm at a loss to understand how this could happen. In my own
    measurements, the code shown above runs faster than a simple loop in
    all cases above n > 12, more than twice as fast when n > 17, more
    than three times as fast when n > 42, and going up from there. What
    might account for these radically different results?


    May be, number of repetitions?
    I run the test only once. That gives relative advantage to smaller
    code which is less sensitive to cold ICache and to cold branch
    predictors.
    Also, because I am running it only once, my test is not so good in
    seeing differences between various "good" algorithms.
    But I feel that running test once is more realistic.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Thu Dec 19 00:40:32 2024
    Michael S <[email protected]> writes:

    On Wed, 18 Dec 2024 01:33:42 +0200
    Michael S <[email protected]> wrote:

    I am pretty sure that better variant exists, but for tonight it's
    enough.

    Morning fibs.

    Recursive:

    struct pair_t {
    long long a,b;
    };

    // return .a=fib(n), .b=fib(n+1)
    static struct pair_t fib2(long n)
    {
    if (n <= 0) {
    struct pair_t ret = { .a = 0, .b = n==0 ? 1 : 0 };
    return ret;
    }
    // for n > 0
    // fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
    // fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
    long m = (n-1) >> 1;
    struct pair_t ret = fib2(m);
    long long a = ret.a, b = ret.b;
    long long c = a + b;
    if ((n & 1)==0) { // (m,m+1) => ((m+1)*2,(m+1)*2*2+1)
    ret.a = (a + c)*b;
    ret.b = b*b + c*c;
    } else { // (m,m+1) => (m*2+1,(m+1)*2)
    ret.a = a*a + b*b;
    ret.b = (a + c)*b;
    }
    return ret;
    }

    static long long fib(long n)
    {
    struct pair_t x = fib2(n-1);
    return x.b;
    }

    I may have inadvertently misled you. Here is a simple linear
    formulation that uses recursion:

    typedef unsigned long long NN;

    static NN fibonacci_3( NN, NN, unsigned );

    NN
    fibonacci( unsigned n ){
    return fibonacci_3( 1, 0, n );
    }

    NN
    fibonacci_3( NN a, NN b, unsigned n ){
    return n == 0 ? b : fibonacci_3( b, a+b, n-1 );
    }

    Can you apply this idea to your logarithmic scheme?


    Iterative:

    static long long fib(long n)
    {
    if (n <= 0)
    return 0;
    // find MS bit
    unsigned long tmp = n;
    unsigned long bit = 1;
    while (tmp > 1) {
    bit += bit;
    tmp /= 2;
    }
    // for n > 0
    // fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
    // fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
    long long a = 0, b = 1;
    while (bit > 1) {
    long long c = a + b; // fib(n+1)
    bit /= 2;
    if ((n & bit)==0) { // (n-1,n) => (n*2-1,n*2)
    c += a;
    a = a*a + b*b;
    b = c*b;
    } else { // (n-1,n) => (n*2,n*2+1)
    a = (a + c)*b;
    b = b*b + c*c;
    }
    }
    return b;
    }


    Both variants above are O(log(n)).
    In practice for n in range 0 to 92 my measurement gear does not detect differences in speed vs simple loop. If anything, simple loop is a
    little faster.

    Yes, the log(n) can be beaten by simpler schemes for small n.

    I wrote a (recursive) log(n) fibonacci in python. It was able
    to compute fibonacci( 10000000 ) in just under 3 seconds.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Thu Dec 19 00:57:20 2024
    Michael S <[email protected]> writes:

    On Wed, 18 Dec 2024 09:45:38 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Tue, 17 Dec 2024 12:54:19 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    [...]

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }

    Here is my second fastest fibonacci calculation code (for
    relatively small inputs):

    typedef long unsigned long ULL;

    ULL
    fibonacci( unsigned n ){
    ULL b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    ULL na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    From BigO perspective this code looks like it's still O(n) so no
    better than simple loop.

    It is O(n) but it has a smaller constant.

    According to my measurement gear, in range 0 to 92 there are few
    points where it is faster than simple loop, but in majority of
    cases it is slower.

    I'm at a loss to understand how this could happen. In my own
    measurements, the code shown above runs faster than a simple loop in
    all cases above n > 12, more than twice as fast when n > 17, more
    than three times as fast when n > 42, and going up from there. What
    might account for these radically different results?

    May be, number of repetitions?
    I run the test only once. That gives relative advantage to smaller
    code which is less sensitive to cold ICache and to cold branch
    predictors.

    That's an interesting idea. Can you also run a measurement where
    the code is run inside loops? I think it would be instructive
    to compare the results under the two approaches.

    (The server I use seems not to have the necessary support to
    measure very fast code segments that are run only once.)

    Also, because I am running it only once, my test is not so good in
    seeing differences between various "good" algorithms.
    But I feel that running test once is more realistic.

    I think an argument could be made that repeatedly running the code
    in a loop gives a more relevant result. Any short code segment that
    is run only a handful of times will never have a significant impact
    on overall program performance. It's better to optimize under the
    assumption that the code will be "hot": if it is hot then the
    optimization will be worthwhile, and if it isn't hot then the loss
    will be way down in the noise and not worth worrying about.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Student Project@21:1/5 to Michael S on Thu Dec 19 19:14:19 2024
    On 18/12/2024 11:51, Michael S wrote:
    Iterative:

    static long long fib(long n)
    {
    if (n <= 0)
    return 0;
    // find MS bit
    unsigned long tmp = n;
    unsigned long bit = 1;
    while (tmp > 1) {
    bit += bit;
    tmp /= 2;
    }
    // for n > 0
    // fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
    // fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
    long long a = 0, b = 1;
    while (bit > 1) {
    long long c = a + b; // fib(n+1)
    bit /= 2;
    if ((n & bit)==0) { // (n-1,n) => (n*2-1,n*2)
    c += a;
    a = a*a + b*b;
    b = c*b;
    } else { // (n-1,n) => (n*2,n*2+1)
    a = (a + c)*b;
    b = b*b + c*c;
    }
    }
    return b;
    }

    Here is my test result of this iterative function:

    D:\CmdLine\C_Cpp\Chrono06>program
    Testing fibonacci function
    fib(92) = 7540113804746346429
    Time taken by fib: 0 microseconds

    D:\CmdLine\C_Cpp\Chrono06>program
    Testing fibonacci function
    fib(92) = 7540113804746346429
    Time taken by fib: 0 microseconds

    D:\CmdLine\C_Cpp\Chrono06>program
    Testing fibonacci function
    fib(92) = 7540113804746346429
    Time taken by fib: 0 microseconds

    D:\CmdLine\C_Cpp\Chrono06>program
    Testing fibonacci function
    fib(92) = 7540113804746346429
    Time taken by fib: 0 microseconds

    The full program to run in G++ is:

    <+++++++++++++++++++++++++++++++++++++++++++++>

    #include <iostream>
    #include <chrono>

    using namespace std;
    using namespace std::chrono;

    long long fib(long n)
    {
    if (n <= 0)
    return 0;
    // find MS bit
    unsigned long tmp = n;
    unsigned long bit = 1;
    while (tmp > 1)
    {
    bit += bit;
    tmp /= 2;
    }
    // for n > 0
    // fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
    // fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
    long long a = 0, b = 1;
    while (bit > 1)
    {
    long long c = a + b; // fib(n+1)
    bit /= 2;
    if ((n & bit) == 0)
    { // (n-1,n) => (n*2-1,n*2)
    c += a;
    a = a * a + b * b;
    b = c * b;
    }
    else
    { // (n-1,n) => (n*2,n*2+1)
    a = (a + c) * b;
    b = b * b + c * c;
    }
    }
    return b;
    }

    int main(void)
    {
    cout << "Testing fibonacci function\n";
    auto start = high_resolution_clock::now();
    constexpr int num = 92;
    long long result = fib (num);
    auto stop = high_resolution_clock::now();

    auto duration = duration_cast<microseconds>(stop - start);
    cout << "fib(" << num << ") = " << result << "\n";
    cout << "Time taken by fib: " << duration.count() << " microseconds\n";

    }

    <+++++++++++++++++++++++++++++++++++++++++++++>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Thu Dec 19 23:45:49 2024
    On Thu, 19 Dec 2024 00:57:20 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Wed, 18 Dec 2024 09:45:38 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Tue, 17 Dec 2024 12:54:19 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    [...]

    #include <stdio.h>
    #include <stdlib.h>
    #include <intrin.h>

    static long long fib(long n)
    {
    if (fib <= 0)
    return 0;
    long long f0 = 0, f1 = 1;
    for (long i = 1; i < n; ++i) {
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    return f1;
    }

    Here is my second fastest fibonacci calculation code (for
    relatively small inputs):

    typedef long unsigned long ULL;

    ULL
    fibonacci( unsigned n ){
    ULL b = n&1, a = b^1;

    if( n & 2 ) a += b, b += a;
    if( n & 4 ) a += b, b += a, a += b, b += a;
    if( n & 8 ){
    ULL na = 13*a+21*b, nb = 21*a+34*b;
    a = na, b = nb;
    }

    n >>= 4;
    while( n-- ){
    ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
    a = na, b = nb;
    }

    return b;
    }

    From BigO perspective this code looks like it's still O(n) so no
    better than simple loop.

    It is O(n) but it has a smaller constant.

    According to my measurement gear, in range 0 to 92 there are few
    points where it is faster than simple loop, but in majority of
    cases it is slower.

    I'm at a loss to understand how this could happen. In my own
    measurements, the code shown above runs faster than a simple loop
    in all cases above n > 12, more than twice as fast when n > 17,
    more than three times as fast when n > 42, and going up from
    there. What might account for these radically different results?

    May be, number of repetitions?
    I run the test only once. That gives relative advantage to smaller
    code which is less sensitive to cold ICache and to cold branch
    predictors.

    That's an interesting idea. Can you also run a measurement where
    the code is run inside loops? I think it would be instructive
    to compare the results under the two approaches.

    (The server I use seems not to have the necessary support to
    measure very fast code segments that are run only once.)

    Also, because I am running it only once, my test is not so good in
    seeing differences between various "good" algorithms.
    But I feel that running test once is more realistic.

    I think an argument could be made that repeatedly running the code
    in a loop gives a more relevant result. Any short code segment that
    is run only a handful of times will never have a significant impact
    on overall program performance. It's better to optimize under the
    assumption that the code will be "hot": if it is hot then the
    optimization will be worthwhile, and if it isn't hot then the loss
    will be way down in the noise and not worth worrying about.


    I feel that running fib(n) with the same n in loop too unrealistic.
    So I decided to run, at least, with different values of n.
    Ended up spending about a hour just to build a test bench.

    The answer is - in a loop of more than dozen iterations your code is
    indeed faster. Esp. so for hundred or more iterations.

    Here is my test bench.

    #include <stdio.h>
    #include <stdlib.h>
    #include <stdint.h>
    #include <intrin.h>

    long long fib(long n);
    volatile long long dummy;

    int main(int argz, char** argv)
    {
    if (argz < 2) {
    printf("Go away!\n");
    return 1;
    }

    char* endp;
    long n = strtol(argv[1], &endp, 0);
    if (endp == argv[1]) {
    printf("%s is not a number. Go away!\n", argv[1]);
    return 1;
    }

    if (argz == 2) {
    unsigned long long t0 = _rdtsc();
    long long f = fib(n);
    unsigned long long t1 = _rdtsc();
    printf("fib(%ld)=%lld. %lld clock cycles.\n", n, f, t1 - t0);
    return 0;
    }

    // argz > 2
    if (argv[2][0]=='t') {
    long long f0 = 0, f1 = 1;
    for (long i = 0; i <= n; ++i) {
    long long res = fib(i);
    if (res != f0) {
    printf("Mistake at n = %ld. %lld instead of %lld\n"
    , i, res, f0);
    return 1;
    }
    long long f2 = f0 + f1;
    f0 = f1;
    f1 = f2;
    }
    printf("o.k.\n");
    return 0;
    }

    long m = strtol(argv[2], &endp, 0);
    if (endp == argv[2]) {
    printf("%s is not a number. Go away!\n", argv[2]);
    return 1;
    }

    if (n > m) {
    long tmp = n; n = m; m = tmp;
    }

    unsigned long dn = m - n + 1;
    if (dn > 1000000) {
    printf(
    "Range [%ld..%ld] is wider than 1M. Unsupported. I am sorry.\n"
    , n, m);
    return 1;
    }

    size_t n_iter = 10;
    if (argz > 3) { // get non-default number of iterations
    unsigned long long val = strtoull(argv[3], &endp, 0);
    if (endp == argv[3]) {
    printf("%s is not a number. Go away!\n", argv[3]);
    return 1;
    }
    if (val < 1 || val > 10000000) {
    printf(
    "number of iterations %s is out of range "
    "[1..10,000,000]. Go away!\n", argv[3]);
    return 1;
    }
    n_iter = val;
    }

    long* an = malloc(n_iter*sizeof(long));
    if (an) {
    // prepare array of random arguments
    srand(1);
    for (size_t i = 0; i < n_iter; ++i)
    an[i] = (long)((((uint64_t)(rand() & 0x7fff) * dn) >> 15) + n);

    dummy = 0;
    unsigned long long t0 = _rdtsc();
    for (size_t i = 0; i < n_iter; ++i)
    dummy += fib(an[i]);
    unsigned long long t1 = _rdtsc();
    dummy = 0;

    printf(
    "n in range [%ld..%ld]. %zu iterations."
    " %llu clocks. %llu clocks/iter\n"
    ,n, m, n_iter, t1-t0, (t1-t0)/n_iter);

    free(an);
    } else {
    perror("malloc()");
    return 1;
    }

    return 0;
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Fri Dec 20 02:17:03 2024
    On Thu, 19 Dec 2024 00:40:32 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Wed, 18 Dec 2024 01:33:42 +0200
    Michael S <[email protected]> wrote:

    I am pretty sure that better variant exists, but for tonight it's
    enough.

    Morning fibs.

    Recursive:

    struct pair_t {
    long long a,b;
    };

    // return .a=fib(n), .b=fib(n+1)
    static struct pair_t fib2(long n)
    {
    if (n <= 0) {
    struct pair_t ret = { .a = 0, .b = n==0 ? 1 : 0 };
    return ret;
    }
    // for n > 0
    // fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
    // fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
    long m = (n-1) >> 1;
    struct pair_t ret = fib2(m);
    long long a = ret.a, b = ret.b;
    long long c = a + b;
    if ((n & 1)==0) { // (m,m+1) => ((m+1)*2,(m+1)*2*2+1)
    ret.a = (a + c)*b;
    ret.b = b*b + c*c;
    } else { // (m,m+1) => (m*2+1,(m+1)*2)
    ret.a = a*a + b*b;
    ret.b = (a + c)*b;
    }
    return ret;
    }

    static long long fib(long n)
    {
    struct pair_t x = fib2(n-1);
    return x.b;
    }

    I may have inadvertently misled you. Here is a simple linear
    formulation that uses recursion:

    typedef unsigned long long NN;

    static NN fibonacci_3( NN, NN, unsigned );

    NN
    fibonacci( unsigned n ){
    return fibonacci_3( 1, 0, n );
    }

    NN
    fibonacci_3( NN a, NN b, unsigned n ){
    return n == 0 ? b : fibonacci_3( b, a+b, n-1 );
    }

    Can you apply this idea to your logarithmic scheme?



    Not really. But I can cheat.
    Below is slightly modified iterative algorithm coded as recursion.

    static
    long long fib_3(long long a, long long b, unsigned long rev_n) {
    if (rev_n == 1)
    return b;

    long long c = a + b;
    if ((rev_n & 1) == 0)
    return fib_3(a*a+b*b, (a+c)*b, rev_n/2);
    else
    return fib_3((a+c)*b, b*b+c*c, rev_n/2);
    }

    static
    long long fib_bitrev(unsigned long a, unsigned long b) {
    return a==1 ? fib_3(0, 1, b) : fib_bitrev( a / 2, b * 2 + a % 2);
    }

    long long fib(long n) {
    return n <= 0 ? 0 : fib_bitrev(n, 1);
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Fri Dec 20 14:31:54 2024
    Michael S <[email protected]> writes:

    On Thu, 19 Dec 2024 00:57:20 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Wed, 18 Dec 2024 09:45:38 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:
    [...]
    According to my measurement gear, in range 0 to 92 there are few
    points where [the code I posted] is faster than simple loop, but
    in majority of cases it is slower.

    I'm at a loss to understand how this could happen. In my own
    measurements, the code shown above runs faster than a simple loop
    in all cases above n > 12, more than twice as fast when n > 17,
    more than three times as fast when n > 42, and going up from
    there. What might account for these radically different results?

    May be, number of repetitions?
    I run the test only once. That gives relative advantage to smaller
    code which is less sensitive to cold ICache and to cold branch
    predictors.

    That's an interesting idea. Can you also run a measurement where
    the code is run inside loops? I think it would be instructive
    to compare the results under the two approaches.
    [...]

    I feel that running fib(n) with the same n in loop too unrealistic.
    So I decided to run, at least, with different values of n.

    I agree, that is a better choice for a measurement test load.

    Ended up spending about a hour just to build a test bench.

    The answer is - in a loop of more than dozen iterations your code is
    indeed faster. Esp. so for hundred or more iterations.

    I'm happy to learn that my instincts were validated here. And I
    also learned something, about the effects of code warmup. Very
    interesting.

    Here is my test bench. [...]

    Thank you for that. Sadly I am not able to run it because my
    test environment is lacking _rdtsc(), and I haven't been able
    to find out how to install it.

    By the way, I send you an email earlier today. Did that get to
    you?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Fri Dec 20 14:59:07 2024
    Michael S <[email protected]> writes:

    On Thu, 19 Dec 2024 00:40:32 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Wed, 18 Dec 2024 01:33:42 +0200
    Michael S <[email protected]> wrote:

    [a recursive logarithmic formulation]

    I may have inadvertently misled you. Here is a simple linear
    formulation that uses recursion:

    typedef unsigned long long NN;

    static NN fibonacci_3( NN, NN, unsigned );

    NN
    fibonacci( unsigned n ){
    return fibonacci_3( 1, 0, n );
    }

    NN
    fibonacci_3( NN a, NN b, unsigned n ){
    return n == 0 ? b : fibonacci_3( b, a+b, n-1 );
    }

    Can you apply this idea to your logarithmic scheme?

    Not really. But I can cheat.
    Below is slightly modified iterative algorithm coded as recursion.

    I don't think of your code below as cheating at all. It resembles
    an earlier recursive version that I did, and is pretty much the
    sort of thing I had in mind.

    static
    long long fib_3(long long a, long long b, unsigned long rev_n) {
    if (rev_n == 1)
    return b;

    long long c = a + b;
    if ((rev_n & 1) == 0)
    return fib_3(a*a+b*b, (a+c)*b, rev_n/2);
    else
    return fib_3((a+c)*b, b*b+c*c, rev_n/2);
    }

    The factoring with 'c' is a nice improvement. Thank you for
    that.

    static
    long long fib_bitrev(unsigned long a, unsigned long b) {
    return a==1 ? fib_3(0, 1, b) : fib_bitrev( a / 2, b * 2 + a % 2);
    }

    long long fib(long n) {
    return n <= 0 ? 0 : fib_bitrev(n, 1);
    }

    The intermediate function fib_bitrev() reverses the bits so they
    can be processed from high to low. I did that by passing two
    arguments, a mask and the value of n, and shifting the mask.
    Combining that with your 'c' expression factoring, I got this:

    typedef unsigned long long NN;

    static unsigned high_bit( unsigned );
    static NN lfib4( NN, NN, unsigned, unsigned );

    NN
    logarithmic_fibonacci( unsigned n ){
    return lfib4( 1, 0, high_bit( n ), n );
    }

    unsigned
    high_bit( unsigned n ){
    n |= n>>1;
    n |= n>>2;
    n |= n>>4;;
    n |= n>>8;;
    n |= n>>16;;
    return n ^ n>>1;
    }

    NN
    lfib4( NN a, NN b, unsigned m, unsigned n ){
    NN c = a+b;
    return
    m & n ? lfib4( (a+c)*b, b*b+c*c, m>>1, n ) :
    m ? lfib4( a*a+b*b, (a+c)*b, m>>1, n ) :
    /*****/ b;
    }

    I ran a python version to compute fibonacci( 10000000 ). This code
    ran 30% faster than the previous version, which I would be is
    almost entirely due to the expression factoring with 'c'. Great!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Sun Dec 22 00:14:56 2024
    On Fri, 20 Dec 2024 14:31:54 -0800
    Tim Rentsch <[email protected]> wrote:


    Thank you for that. Sadly I am not able to run it because my
    test environment is lacking _rdtsc(), and I haven't been able
    to find out how to install it.


    Any x86-64 Linux should have _rdtsc() as inline function (or built-in)
    defined in x86intrin.h. No installation required.

    By the way, I send you an email earlier today. Did that get to
    you?

    Yes, I got it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Sun Dec 22 00:07:08 2024
    On Fri, 20 Dec 2024 14:59:07 -0800
    Tim Rentsch <[email protected]> wrote:


    I ran a python version to compute fibonacci( 10000000 ). This code
    ran 30% faster than the previous version, which I would be is
    almost entirely due to the expression factoring with 'c'. Great!


    This one does fib(10M) in 89 msec on my very old home PC.

    #include "gmp.h"

    void fib(mpz_t rop, long n)
    {
    if (n <= 0) {
    mpz_set_ui(rop, 0);
    return;
    }

    // find MS bit
    unsigned long tmp = n;
    unsigned long bit = 1;
    while (tmp > 1) {
    bit += bit;
    tmp /= 2;
    }
    // for n > 0
    // fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
    // fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
    mpz_t a, b, aa, bb;
    mpz_init_set_si(a, 0);
    mpz_init_set_si(b, 1);
    mpz_init(aa);
    mpz_init(bb);
    while (bit > 1) {
    mpz_mul(aa, a, a);
    mpz_mul(bb, b, b);
    mpz_add(aa, aa, bb); // aa = a*a+b*b

    mpz_add(a, a, a);
    mpz_add(a, a, b);
    mpz_mul(bb, a, b); // bb = (a*2+b)*b

    bit /= 2;
    if (n & bit) {
    mpz_add(aa, aa, bb);
    mpz_swap(aa, bb);
    }
    mpz_swap(a, aa);
    mpz_swap(b, bb);
    }
    mpz_swap(rop, b);
    mpz_clear(a);
    mpz_clear(b);
    mpz_clear(aa);
    mpz_clear(bb);
    }

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From James Kuyper@21:1/5 to Michael S on Sat Dec 21 17:55:20 2024
    On 12/21/24 17:14, Michael S wrote:
    On Fri, 20 Dec 2024 14:31:54 -0800
    Tim Rentsch <[email protected]> wrote:

    Thank you for that. Sadly I am not able to run it because my
    test environment is lacking _rdtsc(), and I haven't been able
    to find out how to install it.


    Any x86-64 Linux should have _rdtsc() as inline function (or built-in) defined in x86intrin.h. No installation required.

    On my x86-64 Linux system, that header (indirectly) declares __rdtsc(),
    but not _rdtsc.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to James Kuyper on Sun Dec 22 02:22:14 2024
    On Sat, 21 Dec 2024 17:55:20 -0500
    James Kuyper <[email protected]> wrote:

    On 12/21/24 17:14, Michael S wrote:
    On Fri, 20 Dec 2024 14:31:54 -0800
    Tim Rentsch <[email protected]> wrote:

    Thank you for that. Sadly I am not able to run it because my
    test environment is lacking _rdtsc(), and I haven't been able
    to find out how to install it.


    Any x86-64 Linux should have _rdtsc() as inline function (or
    built-in) defined in x86intrin.h. No installation required.

    On my x86-64 Linux system, that header (indirectly) declares
    __rdtsc(), but not _rdtsc.



    Thank you.
    I'd guess I knew it once, but because I don't run microbenchmark on
    Linux all that often, had forgotten.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Dec 22 10:33:57 2024
    Michael S <[email protected]> writes:

    On Fri, 20 Dec 2024 14:31:54 -0800
    Tim Rentsch <[email protected]> wrote:

    Thank you for that. Sadly I am not able to run it because my
    test environment is lacking _rdtsc(), and I haven't been able
    to find out how to install it.

    Any x86-64 Linux should have _rdtsc() as inline function (or built-in) defined in x86intrin.h. No installation required.

    Okay. As it turns out knowing that doesn't help me, but thank
    you for the info. I'm taking this as a sign from heaven that
    I should be spending my energies somewhere other than measuring
    performance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Dec 22 10:25:59 2024
    Michael S <[email protected]> writes:

    On Fri, 20 Dec 2024 14:59:07 -0800
    Tim Rentsch <[email protected]> wrote:

    I ran a python version to compute fibonacci( 10000000 ). This code
    ran 30% faster than the previous version, which I would be is
    almost entirely due to the expression factoring with 'c'. Great!

    This one does fib(10M) in 89 msec on my very old home PC.

    [code]

    You are the speed king!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Mon Dec 23 00:05:03 2024
    On Sun, 22 Dec 2024 10:33:57 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Fri, 20 Dec 2024 14:31:54 -0800
    Tim Rentsch <[email protected]> wrote:

    Thank you for that. Sadly I am not able to run it because my
    test environment is lacking _rdtsc(), and I haven't been able
    to find out how to install it.

    Any x86-64 Linux should have _rdtsc() as inline function (or
    built-in) defined in x86intrin.h. No installation required.

    Okay. As it turns out knowing that doesn't help me, but thank
    you for the info. I'm taking this as a sign from heaven that
    I should be spending my energies somewhere other than measuring
    performance.


    Did you see correction of James Kuyper?
    Under Linux it is called __rdtsc().

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Sun Dec 22 17:17:41 2024
    Michael S <[email protected]> writes:

    On Sun, 22 Dec 2024 10:33:57 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Fri, 20 Dec 2024 14:31:54 -0800
    Tim Rentsch <[email protected]> wrote:

    Thank you for that. Sadly I am not able to run it because my
    test environment is lacking _rdtsc(), and I haven't been able
    to find out how to install it.

    Any x86-64 Linux should have _rdtsc() as inline function (or
    built-in) defined in x86intrin.h. No installation required.

    Okay. As it turns out knowing that doesn't help me, but thank
    you for the info. I'm taking this as a sign from heaven that
    I should be spending my energies somewhere other than measuring
    performance.

    Did you see correction of James Kuyper?
    Under Linux it is called __rdtsc().

    Yes I saw that.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Mon Dec 23 17:47:08 2024
    On Sun, 22 Dec 2024 10:25:59 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Fri, 20 Dec 2024 14:59:07 -0800
    Tim Rentsch <[email protected]> wrote:

    I ran a python version to compute fibonacci( 10000000 ). This code
    ran 30% faster than the previous version, which I would be is
    almost entirely due to the expression factoring with 'c'. Great!

    This one does fib(10M) in 89 msec on my very old home PC.

    [code]

    You are the speed king!

    But approximately the same code in python is MUCH slower.
    2.9 sec on 11 y.o. PC that still serves me as a desktop at work.
    2.0 sec on 5.5 y.o. mall server.

    I don't really understand why.
    I expected that nearly all time is spend in multiplication of few huge
    numbers, probably over 75% of time in last couple of steps (6
    multiplications). And I was under impression that internally Python
    uses the same GMP library that I used in C. So I expected the same
    speed, more or less. May be, 1.5x slowdown because of less efficient
    memory handling in Python. 35x difference is a big surprise.


    Here is my python routine.

    def fib(n):
    if n < 1:
    return 0

    tmp = n
    msb = 1
    while tmp > 1:
    msb += msb
    tmp //= 2

    f0=0; f1=1;
    while msb > 1:
    ff0 = f0*f0 + f1*f1
    f1 = (f0+f0+f1)*f1
    f0 = ff0
    msb //= 2
    if msb & n:
    f0 = f1
    f1 += ff0
    return f1

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Tue Dec 24 05:21:42 2024
    Michael S <[email protected]> writes:

    On Sun, 22 Dec 2024 10:25:59 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Fri, 20 Dec 2024 14:59:07 -0800
    Tim Rentsch <[email protected]> wrote:

    I ran a python version to compute fibonacci( 10000000 ). This code
    ran 30% faster than the previous version, which I would be is
    almost entirely due to the expression factoring with 'c'. Great!

    This one does fib(10M) in 89 msec on my very old home PC.

    [code]

    You are the speed king!

    But approximately the same code in python is MUCH slower.
    2.9 sec on 11 y.o. PC that still serves me as a desktop at work.
    2.0 sec on 5.5 y.o. mall server.

    That matches my experience with doing this in python.

    I don't really understand why.
    I expected that nearly all time is spend in multiplication of few huge numbers, probably over 75% of time in last couple of steps (6 multiplications). And I was under impression that internally Python
    uses the same GMP library that I used in C. So I expected the same
    speed, more or less. May be, 1.5x slowdown because of less efficient
    memory handling in Python. 35x difference is a big surprise.

    Yes, I have no explanation either. Your reasoning looks sound to
    me.

    Here is my python routine.

    def fib(n):
    if n < 1:
    return 0

    tmp = n
    msb = 1
    while tmp > 1:
    msb += msb
    tmp //= 2

    f0=0; f1=1;
    while msb > 1:
    ff0 = f0*f0 + f1*f1
    f1 = (f0+f0+f1)*f1
    f0 = ff0
    msb //= 2
    if msb & n:
    f0 = f1
    f1 += ff0
    return f1

    Here is my latest python code.

    def fibonacci( n ) :
    return ff2( 1, 0, high_mask( 1, n ), n )

    def high_mask( m, n ) :
    return m>>1 if m > n else high_mask( m<<1, n )

    def ff2( a, b, m, n ) :
    c = a+b
    if m & n : return ff2( (a+c)*b, b*b+c*c, m>>1, n )
    if m : return ff2( a*a+b*b, (a+c)*b, m>>1, n )
    return b

    Performance trials of these two definitions gave nearly identical
    results, within 0.4 percent.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Tim Rentsch on Wed Dec 25 11:05:05 2024
    On Tue, 24 Dec 2024 05:21:42 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Sun, 22 Dec 2024 10:25:59 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Fri, 20 Dec 2024 14:59:07 -0800
    Tim Rentsch <[email protected]> wrote:

    I ran a python version to compute fibonacci( 10000000 ). This
    code ran 30% faster than the previous version, which I would be
    is almost entirely due to the expression factoring with 'c'.
    Great!

    This one does fib(10M) in 89 msec on my very old home PC.

    [code]

    You are the speed king!

    But approximately the same code in python is MUCH slower.
    2.9 sec on 11 y.o. PC that still serves me as a desktop at work.
    2.0 sec on 5.5 y.o. mall server.

    That matches my experience with doing this in python.

    I don't really understand why.
    I expected that nearly all time is spend in multiplication of few
    huge numbers, probably over 75% of time in last couple of steps (6 multiplications). And I was under impression that internally Python
    uses the same GMP library that I used in C. So I expected the same
    speed, more or less. May be, 1.5x slowdown because of less
    efficient memory handling in Python. 35x difference is a big
    surprise.

    Yes, I have no explanation either. Your reasoning looks sound to
    me.


    I did few more time measurements both with GMP and with Python. It
    clearly showed that they use not just different implementations of
    bignum multiplication, but different algorithms.
    GMP multiplication for operands in range of above ~200,000 bits is approximately O(N*log(N)) with number of bits, or just a little bit
    slower than that.
    Python multiplication scale much worse, approximately as (N**1.55).

    And indeed after binging more I had seen the following discussion on
    reddit: https://www.reddit.com/r/Python/comments/7h62ul/python_largeinteger_computation_speed_a_comparison/

    A relevant quote:
    "Python seems to use two algorithms for bignum multiplication: "the
    O(N**2) school algorithm" and Karatsuba if both operands contain more
    than 70 decimal digits. (Your example reaches 70 decimal digits already
    after the 5th(?) squaring, and after the 22nd there are ~17million
    decimal digits(?). The wast majority of the time is spent on the last
    one or two multiplications i.e. Karatsuba.)

    Meanwhile GMP implements seven multiplication algorithms: The same(?)
    "school" and Karatsuba algorithms, plus various Toom and an FFT method (Strassen), but I'm unclear on what the exact thresholds are that
    select these. It seems these thresholds are in "limbs" ("the part of a multi-precision number that fits in a single machine word") instead of
    decimal digits, with e.g. FFT method being used after "somewhere
    between 3000 and 10000 limbs", i.e. >~30k - 100k(?) decimal digits, so
    it would probably be used for the relevant steps in your example."


    It seem that that threshold for FFT method mentioned in the quote
    applies to number of limbs (i.e. 64-bit groups of bits in our specific
    case) in the product rather than in the operands. Also it is possible
    that in the newer versions of GMP the threshold is a lower than it
    was 7 years ago.

    I wonder, why python uses its own thing instead of GMP.
    Incompatible licenses or plain NIH ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Michael S@21:1/5 to Michael S on Wed Dec 25 13:33:08 2024
    On Wed, 25 Dec 2024 11:05:05 +0200
    Michael S <[email protected]> wrote:

    On Tue, 24 Dec 2024 05:21:42 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Sun, 22 Dec 2024 10:25:59 -0800
    Tim Rentsch <[email protected]> wrote:

    Michael S <[email protected]> writes:

    On Fri, 20 Dec 2024 14:59:07 -0800
    Tim Rentsch <[email protected]> wrote:

    I ran a python version to compute fibonacci( 10000000 ). This
    code ran 30% faster than the previous version, which I would be
    is almost entirely due to the expression factoring with 'c'.
    Great!

    This one does fib(10M) in 89 msec on my very old home PC.

    [code]

    You are the speed king!

    But approximately the same code in python is MUCH slower.
    2.9 sec on 11 y.o. PC that still serves me as a desktop at work.
    2.0 sec on 5.5 y.o. mall server.

    That matches my experience with doing this in python.

    I don't really understand why.
    I expected that nearly all time is spend in multiplication of few
    huge numbers, probably over 75% of time in last couple of steps (6 multiplications). And I was under impression that internally
    Python uses the same GMP library that I used in C. So I expected
    the same speed, more or less. May be, 1.5x slowdown because of
    less efficient memory handling in Python. 35x difference is a big surprise.

    Yes, I have no explanation either. Your reasoning looks sound to
    me.


    I did few more time measurements both with GMP and with Python. It
    clearly showed that they use not just different implementations of
    bignum multiplication, but different algorithms.
    GMP multiplication for operands in range of above ~200,000 bits is approximately O(N*log(N)) with number of bits, or just a little bit
    slower than that.
    Python multiplication scale much worse, approximately as (N**1.55).

    And indeed after binging more I had seen the following discussion on
    reddit: https://www.reddit.com/r/Python/comments/7h62ul/python_largeinteger_computation_speed_a_comparison/

    A relevant quote:
    "Python seems to use two algorithms for bignum multiplication: "the
    O(N**2) school algorithm" and Karatsuba if both operands contain more
    than 70 decimal digits. (Your example reaches 70 decimal digits
    already after the 5th(?) squaring, and after the 22nd there are
    ~17million decimal digits(?). The wast majority of the time is spent
    on the last one or two multiplications i.e. Karatsuba.)

    Meanwhile GMP implements seven multiplication algorithms: The same(?) "school" and Karatsuba algorithms, plus various Toom and an FFT method (Strassen), but I'm unclear on what the exact thresholds are that
    select these. It seems these thresholds are in "limbs" ("the part of a multi-precision number that fits in a single machine word") instead of decimal digits, with e.g. FFT method being used after "somewhere
    between 3000 and 10000 limbs", i.e. >~30k - 100k(?) decimal digits, so
    it would probably be used for the relevant steps in your example."


    It seem that that threshold for FFT method mentioned in the quote
    applies to number of limbs (i.e. 64-bit groups of bits in our specific
    case) in the product rather than in the operands. Also it is possible
    that in the newer versions of GMP the threshold is a lower than it
    was 7 years ago.

    I wonder, why python uses its own thing instead of GMP.
    Incompatible licenses or plain NIH ?


    Forgot to mention.
    As said in the same redit thread, one can use GMP from python. It's
    pretty easy, just not the default.

    import gmpy2

    def fib(n):
    if n < 1:
    return 0

    tmp = n
    msb = 1
    while tmp > 1:
    msb += msb
    tmp //= 2
    f0=gmpy2.mpz(0); f1=gmpy2.mpz(1);
    while msb > 1:
    ff0 = f0*f0 + f1*f1
    f1 = (f0+f0+f1)*f1
    f0 = ff0
    msb //= 2
    if msb & n:
    f0 = f1
    f1 += ff0
    return f1


    For big values of n gmpy2-based variant calculates Fibbonaci numbers at approximately the same speed as C variant. For smaller numbers - not
    quite the same as C and for n < ~10,000 it is slower than default
    python math.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Thu Dec 26 18:54:59 2024
    Michael S <[email protected]> writes:

    [GMP in python?]

    Forgot to mention.
    As said in the same redit thread, one can use GMP from python. It's
    pretty easy, just not the default.

    [example]

    Thank you, that is good to know.

    For big values of n gmpy2-based variant calculates Fibbonaci numbers at approximately the same speed as C variant. For smaller numbers - not
    quite the same as C and for n < ~10,000 it is slower than default
    python math.

    So now we need a hybrid method where "small" numbers use default
    python algorithms and advance to GMP for larger numbers, with each
    of those two already being hybrids? A hybrid**2 algorithm? I like
    it. ;)

    By the way, in the last week or so I sent you two emails. Could I
    ask you to send a short email in response (assuming they arrived,
    which I expect they did)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Rentsch@21:1/5 to Michael S on Thu Dec 26 18:42:35 2024
    Michael S <[email protected]> writes:

    On Tue, 24 Dec 2024 05:21:42 -0800
    Tim Rentsch <[email protected]> wrote:

    [python runs much slower than GMP]

    Yes, I have no explanation either. Your reasoning looks sound to
    me.

    I did few more time measurements both with GMP and with Python. It
    clearly showed that they use not just different implementations of
    bignum multiplication, but different algorithms.
    GMP multiplication for operands in range of above ~200,000 bits is approximately O(N*log(N)) with number of bits, or just a little bit
    slower than that.
    Python multiplication scale much worse, approximately as (N**1.55).

    That's consistent with using Karatsuba: log2( 3 ) =~ 1.585.

    And indeed after binging more I had seen the following discussion on
    reddit: [link]

    A relevant quote:
    "Python seems to use two algorithms for bignum multiplication: "the
    O(N**2) school algorithm" and Karatsuba if both operands contain more
    than 70 decimal digits. (Your example reaches 70 decimal digits already after the 5th(?) squaring, and after the 22nd there are ~17million
    decimal digits(?). The wast majority of the time is spent on the last
    one or two multiplications i.e. Karatsuba.)

    Meanwhile GMP implements seven multiplication algorithms: The same(?) "school" and Karatsuba algorithms, plus various Toom and an FFT method (Strassen), but I'm unclear on what the exact thresholds are that
    select these. It seems these thresholds are in "limbs" ("the part of a multi-precision number that fits in a single machine word") instead of decimal digits, with e.g. FFT method being used after "somewhere
    between 3000 and 10000 limbs", i.e. >~30k - 100k(?) decimal digits, so
    it would probably be used for the relevant steps in your example."

    That ratio suggests the limbs are 32 bits (ie, roughly 10 digits each).

    It seem that that threshold for FFT method mentioned in the quote
    applies to number of limbs (i.e. 64-bit groups of bits in our specific
    case) in the product rather than in the operands. Also it is possible
    that in the newer versions of GMP the threshold is a lower than it
    was 7 years ago.

    I have no intuition regarding whether a threshold would be chosen
    based on the sizes of the operands or the sizes of the product. For
    this particular problem it's only one iteration difference anyway.

    I wonder, why python uses its own thing instead of GMP.
    Incompatible licenses or plain NIH ?

    Perhaps (1) python doesn't want to be bound by the rather severe
    terms of a GNU license, or (2) historical inertia.

    Let me add that I appreciate your excellent investigation results.
    They reinforce my view that it was wise for me to cease the deep
    dives into performance issues. :)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)