Forum: >>> Magnum BBS <<<

constexpr is really very smart!

From Student Project@21:1/5 to All on Sun Dec 15 20:20:42 2024

The constexpr is really very smart because it can speed up algorithms
1000 times according to Dave, Microsoft retired engineer. He has proved it
by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like in his example. It was almost instant at the blink of the eyes.

D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 270
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 257
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 171
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 176

Amazing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Student Project on Mon Dec 16 11:28:08 2024

On Sun, 15 Dec 2024 20:20:42 +0000
Student Project <[email protected]d> wrote:

The constexpr is really very smart because it can speed up algorithms
1000 times according to Dave, Microsoft retired engineer. He has
proved it by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like in
his example. It was almost instant at the blink of the eyes.

D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 270
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 257
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 171
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 176

Amazing.

I didn't see the video (I never see that type of videos), but 270
microseconds sound astonishingly slow for fib(35).

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

int main(int argz, char** argv)
{
if (argz < 2) {
printf("Go away!\n");
return 1;
}

char* endp;
long n = strtol(argv[1], &endp, 0);
if (endp == argv[1]) {
printf("%s is not a number. Go away!\n", argv[1]);
return 1;
}

unsigned long long t0 = _rdtsc();
long long f = fib(n);
unsigned long long t1 = _rdtsc();
printf("fib(%ld)=%lld. %lld clock cycles.\n", n, f, t1 - t0);
return 0;
}

$ gcc -s -O2 -Wall fib.c -o fib
$ ./fib 35
fib(35)=9227465. 212 clock cycles.

212 cycles on this 11 y.o. CPU means 62 nanoseconds. Even that time, I
would guess, is mostly a measurement overhead.
$ ./fib 92
fib(92)=7540113804746346429. 281 clock cycles.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Dec 16 11:18:46 2024

On 16/12/2024 10:28, Michael S wrote:

On Sun, 15 Dec 2024 20:20:42 +0000
Student Project <[email protected]d> wrote:

The constexpr is really very smart because it can speed up algorithms
1000 times according to Dave, Microsoft retired engineer. He has
proved it by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like in
his example. It was almost instant at the blink of the eyes.

D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 270
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 257
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 171
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 176

Amazing.

I didn't see the video (I never see that type of videos), but 270 microseconds sound astonishingly slow for fib(35).

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

There are a dozen different ways to calculate Fibonacci numbers, ranging
from very fast to very slow - but demonstrating different things. Even
your program is going to be very slow compared to just using φⁿ / √5.

My guess is that the OP is referring to some kind of naïve recursive
Fibonacci implementation. If the function is declared "constexpr" and
the values used are compile-time constants, then maybe the use of
"constexpr" leads to more of it being pre-calculated at compile time.

The OP is right that "constexpr" can be very "smart" - but I'm not sure
this is the best calculation for demonstrating that. However, I have
not watched the video either, and maybe it's well presented.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Mon Dec 16 13:23:15 2024

On Mon, 16 Dec 2024 11:18:46 +0100
David Brown <[email protected]> wrote:

On 16/12/2024 10:28, Michael S wrote:

On Sun, 15 Dec 2024 20:20:42 +0000
Student Project <[email protected]d> wrote:

The constexpr is really very smart because it can speed up
algorithms 1000 times according to Dave, Microsoft retired
engineer. He has proved it by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like
in his example. It was almost instant at the blink of the eyes.

D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 270
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 257
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 171
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 176

Amazing.

I didn't see the video (I never see that type of videos), but 270 microseconds sound astonishingly slow for fib(35).

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

There are a dozen different ways to calculate Fibonacci numbers,
ranging from very fast to very slow - but demonstrating different
things. Even your program is going to be very slow compared to just
using φⁿ / √5.

My news reader is not sure about equation above.
You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))
It works well with double precision math up to n=75. After that it
would need high precision fp library, so probably [much] slower than
simple loop above at least up to n=92 where we run out of range of int64 result.

My guess is that the OP is referring to some kind of naïve recursive Fibonacci implementation.

Something like that ?
static long long slow_fib(long n)
{
if (n < 1)
return 0;
if (n == 1)
return 1;
return slow_fib(n-2)+slow_fib(n-1);
}

Yes, 270 usec could be considered fast relatively to that. slow_fib(35)
takes 20 msec on my old PC.

If the function is declared "constexpr"
and the values used are compile-time constants, then maybe the use of "constexpr" leads to more of it being pre-calculated at compile time.

But then, how long would it take to compile?
I'd guess for n=35 it's not too bad, but what about n=92?
Slow algorithm is a slow algorithm. Doing it in compile time can only exaggerate the slowness.

The OP is right that "constexpr" can be very "smart" - but I'm not
sure this is the best calculation for demonstrating that. However, I
have not watched the video either, and maybe it's well presented.

If it is video and it is about programming then it can not be good.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Mon Dec 16 13:43:11 2024

On 16/12/2024 12:23, Michael S wrote:

On Mon, 16 Dec 2024 11:18:46 +0100
David Brown <[email protected]> wrote:

On 16/12/2024 10:28, Michael S wrote:

On Sun, 15 Dec 2024 20:20:42 +0000
Student Project <[email protected]d> wrote:

The constexpr is really very smart because it can speed up
algorithms 1000 times according to Dave, Microsoft retired
engineer. He has proved it by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like
in his example. It was almost instant at the blink of the eyes.

D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 270
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 257
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 171
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 176

Amazing.

I didn't see the video (I never see that type of videos), but 270
microseconds sound astonishingly slow for fib(35).

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

There are a dozen different ways to calculate Fibonacci numbers,
ranging from very fast to very slow - but demonstrating different
things. Even your program is going to be very slow compared to just
using φⁿ / √5.

My news reader is not sure about equation above.
You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))

Yes.

It works well with double precision math up to n=75. After that it
would need high precision fp library, so probably [much] slower than
simple loop above at least up to n=92 where we run out of range of int64 result.

Sure, there comes a limit to what works here. Doubles don't have the
precision range of 64-bit integers. (You may also see inaccuracies
build up with floating point even before the range of the mantissa
bits.) Similarly, there is a limit to how high you can get with 64-bit integers. But certainly multi-precision integer arithmetic will be a
lot faster than multi-precision floating point.

There's also a big difference if you want to generate the sequence, or
just the n'th number, or if you are only interested in the size of the
n'th number, or some other property.

My guess is that the OP is referring to some kind of naïve recursive
Fibonacci implementation.

Something like that ?
static long long slow_fib(long n)
{
if (n < 1)
return 0;
if (n == 1)
return 1;
return slow_fib(n-2)+slow_fib(n-1);
}

Yes, 270 usec could be considered fast relatively to that. slow_fib(35)
takes 20 msec on my old PC.

That's what I was guessing, yes. Calculating Fibonacci numbers is often
used as an example for showing how to write recursive functions with a structure like the one above, then looking at different techniques for improving them - such as using a recursive function that takes (n, a, b)
and returns (n - 1, b, a + b), or perhaps memoization. (You could
probably teach a term's worth of a Haskell course based entirely on formulations for calculating Fibonacci functions!)

It can also be an interesting exercise to look at the speed of
calculating Fibonacci numbers for different sizes. I expect your linear integer function will be fastest for small n, then powers of phi will be
faster for a bit, then you'd switch back to integers of progressively
larger but fixed sizes using the recursive formula for double n :

fib(2 * n - 1) = fib(n) ** 2 + fib(n - 1) ** 2
fib(2 * n) = (2 * fib(n - 1) + fib(n)) * fib(n)

As you get even bigger, you would then want to use fancier ways to do
the multiplication - I don't know if those can be used in a way that
interacts well with formulas for the Fibonacci numbers.

I guess that can be left as an exercise for the OP !

If the function is declared "constexpr"
and the values used are compile-time constants, then maybe the use of
"constexpr" leads to more of it being pre-calculated at compile time.

But then, how long would it take to compile?
I'd guess for n=35 it's not too bad, but what about n=92?
Slow algorithm is a slow algorithm. Doing it in compile time can only exaggerate the slowness.

My experience is that gcc will pre-calculate for a certain depth of
recursion here, then give up and generate run-time code. But you would
not have to pre-calculate or expand very far before it made a fair
difference to the run-time of slow_fib(35).

The OP is right that "constexpr" can be very "smart" - but I'm not
sure this is the best calculation for demonstrating that. However, I
have not watched the video either, and maybe it's well presented.

If it is video and it is about programming then it can not be good.

:-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to David Brown on Tue Dec 17 11:08:02 2024

On Mon, 16 Dec 2024 13:43:11 +0100
David Brown <[email protected]> wrote:

On 16/12/2024 12:23, Michael S wrote:

On Mon, 16 Dec 2024 11:18:46 +0100
David Brown <[email protected]> wrote:

On 16/12/2024 10:28, Michael S wrote:

My guess is that the OP is referring to some kind of naïve
recursive Fibonacci implementation.

Something like that ?
static long long slow_fib(long n)
{
if (n < 1)
return 0;
if (n == 1)
return 1;
return slow_fib(n-2)+slow_fib(n-1);
}

Yes, 270 usec could be considered fast relatively to that.
slow_fib(35) takes 20 msec on my old PC.

That's what I was guessing, yes. Calculating Fibonacci numbers is
often used as an example for showing how to write recursive functions
with a structure like the one above, then looking at different
techniques for improving them - such as using a recursive function
that takes (n, a, b) and returns (n - 1, b, a + b), or perhaps
memoization. (You could probably teach a term's worth of a Haskell
course based entirely on formulations for calculating Fibonacci
functions!)

It can also be an interesting exercise to look at the speed of
calculating Fibonacci numbers for different sizes. I expect your
linear integer function will be fastest for small n, then powers of
phi will be faster for a bit, then you'd switch back to integers of progressively larger but fixed sizes using the recursive formula for
double n :

fib(2 * n - 1) = fib(n) ** 2 + fib(n - 1) ** 2
fib(2 * n) = (2 * fib(n - 1) + fib(n)) * fib(n)

As you get even bigger, you would then want to use fancier ways to do
the multiplication - I don't know if those can be used in a way that interacts well with formulas for the Fibonacci numbers.

There exist precise methods of integer multiplication with complexity
of O(N*Log(N)) where N is number of digits. Personally, I never had a
need for them, but I believe that they work.

I guess that can be left as an exercise for the OP !

Since OP's ideas of "fast" are very humble, he is likely to be
satisfied with the very first step above naive:

static long long less_slow_fib(long n)
{
if (n < 3)
return n < 1 ? 0 : 1;
return less_slow_fib(n-2)*2+less_slow_fib(n-3);
}

On my old PC is calculates fib(35) in 24 usec. That is more than
10 times faster than his idea of fast.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Michael S on Tue Dec 17 11:10:45 2024

On 17/12/2024 10:08, Michael S wrote:

On Mon, 16 Dec 2024 13:43:11 +0100
David Brown <[email protected]> wrote:

On 16/12/2024 12:23, Michael S wrote:

On Mon, 16 Dec 2024 11:18:46 +0100
David Brown <[email protected]> wrote:

On 16/12/2024 10:28, Michael S wrote:

My guess is that the OP is referring to some kind of naïve
recursive Fibonacci implementation.

Something like that ?
static long long slow_fib(long n)
{
if (n < 1)
return 0;
if (n == 1)
return 1;
return slow_fib(n-2)+slow_fib(n-1);
}

Yes, 270 usec could be considered fast relatively to that.
slow_fib(35) takes 20 msec on my old PC.

That's what I was guessing, yes. Calculating Fibonacci numbers is
often used as an example for showing how to write recursive functions
with a structure like the one above, then looking at different
techniques for improving them - such as using a recursive function
that takes (n, a, b) and returns (n - 1, b, a + b), or perhaps
memoization. (You could probably teach a term's worth of a Haskell
course based entirely on formulations for calculating Fibonacci
functions!)

It can also be an interesting exercise to look at the speed of
calculating Fibonacci numbers for different sizes. I expect your
linear integer function will be fastest for small n, then powers of
phi will be faster for a bit, then you'd switch back to integers of
progressively larger but fixed sizes using the recursive formula for
double n :

fib(2 * n - 1) = fib(n) ** 2 + fib(n - 1) ** 2
fib(2 * n) = (2 * fib(n - 1) + fib(n)) * fib(n)

As you get even bigger, you would then want to use fancier ways to do
the multiplication - I don't know if those can be used in a way that
interacts well with formulas for the Fibonacci numbers.

There exist precise methods of integer multiplication with complexity
of O(N*Log(N)) where N is number of digits. Personally, I never had a
need for them, but I believe that they work.

I've never needed them either. In fact, in the case of the O(n . log n) algorithm, /nobody/ has needed it - the algorithm is only an improvement
over higher big-O complexity algorithms for at least 2 ** 1729 ** 12
digits. That means you need 39 decimal digits just to express the
number of digits in the numbers you are going to multiply. Maybe the
factor of 1729 will be reduced somewhat in the future, but I really
don't see it being added to the GMP library any time soon :-)

<https://mattermodeling.stackexchange.com/questions/1355/did-the-2019-discovery-of-on-logn-multiplication-have-a-practical-outcome>

There are other algorithms that have practical uses when dealing with
/really/ big numbers. But I think it will be very rare to need anything
beyond Karatsuba, which is reasonably easy to understand - it splits the numbers into two halves, does a bit of re-arrangement, and ends up with
3 smaller multiplications and some additions and subtractions instead of
the traditional 4 smaller multiplications.

(You'll find Karatsuba in RSA algorithms, but no algorithms beyond that.)

I guess that can be left as an exercise for the OP !

Since OP's ideas of "fast" are very humble, he is likely to be
satisfied with the very first step above naive:

static long long less_slow_fib(long n)
{
if (n < 3)
return n < 1 ? 0 : 1;
return less_slow_fib(n-2)*2+less_slow_fib(n-3);
}

On my old PC is calculates fib(35) in 24 usec. That is more than
10 times faster than his idea of fast.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Tue Dec 17 12:54:19 2024

Michael S <[email protected]> writes:

On Sun, 15 Dec 2024 20:20:42 +0000
Student Project <[email protected]d> wrote:

The constexpr is really very smart because it can speed up
algorithms 1000 times according to Dave, Microsoft retired
engineer. He has proved it by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like
in his example. It was almost instant at the blink of the eyes.

D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 270
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 257
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 171
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 176

Amazing.

I didn't see the video (I never see that type of videos), but 270 microseconds sound astonishingly slow for fib(35).

Slow for the problem, but not slow for the algorithm. The
point of the video was to compare relative speeds of an
algorithm under two different compilation schemes (with and
without constexpr), not to compare absolute speeds to solve
the problem of computing fibonacci numbers.

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

Here is my second fastest fibonacci calculation code (for
relatively small inputs):

typedef long unsigned long ULL;

ULL
fibonacci( unsigned n ){
ULL b = n&1, a = b^1;

if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
ULL na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}

n >>= 4;
while( n-- ){
ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}

return b;
}

My fastest fibonacci code uses recursion. :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Wed Dec 18 01:33:42 2024

On Tue, 17 Dec 2024 12:54:19 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Sun, 15 Dec 2024 20:20:42 +0000
Student Project <[email protected]d> wrote:

The constexpr is really very smart because it can speed up
algorithms 1000 times according to Dave, Microsoft retired
engineer. He has proved it by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like
in his example. It was almost instant at the blink of the eyes.

D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 270
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 257
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 171
D:\CmdLine\C_Cpp\Chrono02>program
Fibonacci_c: 9227465
Time Taken: 176

Amazing.

I didn't see the video (I never see that type of videos), but 270 microseconds sound astonishingly slow for fib(35).

Slow for the problem, but not slow for the algorithm. The
point of the video was to compare relative speeds of an
algorithm under two different compilation schemes (with and
without constexpr), not to compare absolute speeds to solve
the problem of computing fibonacci numbers.

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

Here is my second fastest fibonacci calculation code (for
relatively small inputs):

typedef long unsigned long ULL;

ULL
fibonacci( unsigned n ){
ULL b = n&1, a = b^1;

if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
ULL na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}

n >>= 4;
while( n-- ){
ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}

return b;
}

From BigO perspective this code looks like it's still O(n) so no better
than simple loop.
According to my measurement gear, in range 0 to 92 there are few points
where it is faster than simple loop, but in majority of cases it is
slower.

My fastest fibonacci code uses recursion. :)

There is 1st atempts at recursive code with better than exponential BigO

static long long fib(long n)
{
if (n < 3)
return n < 1 ? 0 : 1;
long a = n/2;
long b = n-a;
return fib(a)*fib(b-1)+fib(a+1)*fib(b);
}

O(n*n) - not exponential but still pretty bad.
A small obvious modification (specialization) turns it into O(n):

static long long fib(long n)
{
if (n < 3)
return n < 1 ? 0 : 1;
long a = n/2;
long b = n-a;
if (a == b) {
long long f0 = fib(a-1);
long long f1 = fib(a);
long long f2 = f0+f1;
return (f0+f2)*f1;
} else { // b == a+1
long long f0 = fib(a);
long long f1 = fib(a+1);
return f0*f0+f1*f1;
}
}

Still, despite being the same as yours and as a simple loop in BigO
sence, it is several times slower in absolute speed.

I am pretty sure that better variant exists, but for tonight it's
enough.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Student Project@21:1/5 to Keith Thompson on Wed Dec 18 04:45:20 2024

On 17/12/2024 20:20, Keith Thompson wrote:

Student Project <[email protected]d> writes:

The constexpr is really very smart because it can speed up algorithms
1000 times according to Dave, Microsoft retired engineer. He has proved it >> by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like in his
example. It was almost instant at the blink of the eyes.

Can you post the relevant code?

[...]

Code in the boxed area below: (The purpose is to demonstrate constexpr NOT the recursive algorithm. The video is very clear about this).

It is the same code as in the video. G++ gives you the best result after changing one line to:

/*constexpr*/ int result_c = fibonacci_c(num);

G++ result is (multiple runs) - All timings in milliseconds: D:\CmdLine\C_Cpp\Chrono05>program
Fibonacci_c 9227465
Time taken: 9
Fibonacci 9227465
Time taken: 289

D:\CmdLine\C_Cpp\Chrono05>program
Fibonacci_c 9227465
Time taken: 4
Fibonacci 9227465
Time taken: 276

D:\CmdLine\C_Cpp\Chrono05>program
Fibonacci_c 9227465
Time taken: 8
Fibonacci 9227465
Time taken: 284

D:\CmdLine\C_Cpp\Chrono05>program
Fibonacci_c 9227465
Time taken: 5
Fibonacci 9227465
Time taken: 276

clang++ and Visual studio are the slowest. I don't know why.

<+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>
#include <iostream>
#include <chrono>

int fibonacci(int n)
{
if (n <= 1)
return n;
return fibonacci(n - 2) + fibonacci(n - 1);
}

constexpr int fibonacci_c(int n)
{
if (n <= 1)
return n;
return fibonacci_c(n - 2) + fibonacci_c(n - 1);
}

int main(void)
{
// using namespace std::literals::chrono_literals;
auto start = std::chrono::high_resolution_clock::now();
constexpr int num = 35;
/*constexpr*/ int result_c = fibonacci_c(num);

std::cout << "Fibonacci_c " << result_c << "\n";
std::cout << "Time taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()
- start).count() << "\n";

start = std::chrono::high_resolution_clock::now();
int result = fibonacci(num);
std::cout << "Fibonacci " << result << "\n";
std::cout << "Time taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()
- start).count() << "\n";

}

<+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Wed Dec 18 13:51:31 2024

On Wed, 18 Dec 2024 01:33:42 +0200
Michael S <[email protected]> wrote:

I am pretty sure that better variant exists, but for tonight it's
enough.

Morning fibs.

Recursive:

struct pair_t {
long long a,b;
};

// return .a=fib(n), .b=fib(n+1)
static struct pair_t fib2(long n)
{
if (n <= 0) {
struct pair_t ret = { .a = 0, .b = n==0 ? 1 : 0 };
return ret;
}
// for n > 0
// fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
// fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
long m = (n-1) >> 1;
struct pair_t ret = fib2(m);
long long a = ret.a, b = ret.b;
long long c = a + b;
if ((n & 1)==0) { // (m,m+1) => ((m+1)*2,(m+1)*2*2+1)
ret.a = (a + c)*b;
ret.b = b*b + c*c;
} else { // (m,m+1) => (m*2+1,(m+1)*2)
ret.a = a*a + b*b;
ret.b = (a + c)*b;
}
return ret;
}

static long long fib(long n)
{
struct pair_t x = fib2(n-1);
return x.b;
}

Iterative:

static long long fib(long n)
{
if (n <= 0)
return 0;
// find MS bit
unsigned long tmp = n;
unsigned long bit = 1;
while (tmp > 1) {
bit += bit;
tmp /= 2;
}
// for n > 0
// fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
// fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
long long a = 0, b = 1;
while (bit > 1) {
long long c = a + b; // fib(n+1)
bit /= 2;
if ((n & bit)==0) { // (n-1,n) => (n*2-1,n*2)
c += a;
a = a*a + b*b;
b = c*b;
} else { // (n-1,n) => (n*2,n*2+1)
a = (a + c)*b;
b = b*b + c*c;
}
}
return b;
}

Both variants above are O(log(n)).
In practice for n in range 0 to 92 my measurement gear does not detect differences in speed vs simple loop. If anything, simple loop is a
little faster.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Wed Dec 18 03:26:15 2024

Michael S <[email protected]> writes:

On Mon, 16 Dec 2024 11:18:46 +0100
David Brown <[email protected]> wrote:

On 16/12/2024 10:28, Michael S wrote:

On Sun, 15 Dec 2024 20:20:42 +0000
Student Project <[email protected]d> wrote:

The constexpr is really very smart because it can speed up
algorithms 1000 times according to Dave, Microsoft retired
engineer. He has proved it by creating this video:

<https://youtu.be/8-VZoXn8f9U?si=iy1UimoWcaLG31Xi>

On my computer it took 270 microseconds to calculate fib(35) like
in his example. It was almost instant at the blink of the eyes.

[...]

Amazing.

I didn't see the video (I never see that type of videos), but 270
microseconds sound astonishingly slow for fib(35).

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

There are a dozen different ways to calculate Fibonacci numbers,
ranging from very fast to very slow - but demonstrating different
things. Even your program is going to be very slow compared to
just using
[garbled; presumably ceil(pow((1+sqrt(5))/2, n)/sqrt(5))]

No, it won't. The well-known formula for fibonacci numbers is
mathematically exact but numerically poor; it gets wrong answers
even for outputs well under 64 bits. It isn't all that fast either.

My news reader is not sure about equation above.
You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))
It works well with double precision math up to nu. After that it
would need high precision fp library, so probably [much] slower
than simple loop above at least up to n' where we run out of range
of int64 result.

Right. The simple loop is faster.

My guess is that the OP is referring to some kind of naive
recursive Fibonacci implementation.

Something like that ?
static long long slow_fib(long n)
{
if (n < 1)
return 0;
if (n == 1)
return 1;
return slow_fib(n-2)+slow_fib(n-1);
}

Yes, 270 usec could be considered fast relatively to that.
slow_fib(35) takes 20 msec on my old PC.

If the function is declared "constexpr" and the values used are
compile-time constants, then maybe the use of "constexpr" leads
to more of it being pre-calculated at compile time.

But then, how long would it take to compile?
I'd guess for n5 it's not too bad, but what about n'?
Slow algorithm is a slow algorithm. Doing it in compile time can
only exaggerate the slowness.

That isn't right. A computation done at compile time isn't
obliged to use the same algorithm; it is obliged only to get the
same results that the code would. There is a simple way to take
advantage of a function like the one shown above, but constexpr,
to get a pre-compiled result much faster than simply plodding
through a simulation. (Hint: it's a pure function whose output
depends only on its input, and only a small number of inputs are
needed for the overall result.)

The OP is right that "constexpr" can be very "smart" - but I'm
not sure this is the best calculation for demonstrating that.
However, I have not watched the video either, and maybe it's well
presented.

If it is video and it is about programming then it can not be good.

I think the video is not bad. Its primary purpose is to talk about
uses of constexpr that are not completely simple. I've already
played around with such uses so the video didn't have much to say to
me, but for someone who hasn't done that there is a good chance it
will help them go up the learning curve. At this stage of my
programming education I don't get much benefit from tutorial-type
content, but I think there are plenty of people who do.

If I may express a personal view, while I don't begrudge the
existence of tutorial material, I do find it irksome that there
is very little material that addresses a more advanced audience,
people who have a lot of background and experience but simply
lack knowledge in a particular area. I'm tired of slogging
through mush that talks to its readers like they are all still
learning the basics. Being ignorant is not the same as being
uneducated (and really not the same as being stupid.. ick).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Wed Dec 18 14:39:22 2024

On Wed, 18 Dec 2024 03:26:15 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

My news reader is not sure about equation above.
You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))
It works well with double precision math up to nu. After that it
would need high precision fp library, so probably [much] slower
than simple loop above at least up to n' where we run out of range
of int64 result.

It seems like your newsreader don't like combinations of characters that
I considered benign. Like n=75 and n=92
In case you want to see what was an original, it is here: https://www.novabbs.com/devel/article-flat.php?id=6042&group=comp.lang.c%2B%2B#6042

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Student Project on Wed Dec 18 14:17:40 2024

On Wed, 18 Dec 2024 04:45:20 +0000
Student Project <[email protected]d> wrote:

<+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++>
#include <iostream>
#include <chrono>

int fibonacci(int n)
{
if (n <= 1)
return n;
return fibonacci(n - 2) + fibonacci(n - 1);
}

constexpr int fibonacci_c(int n)
{
if (n <= 1)
return n;
return fibonacci_c(n - 2) + fibonacci_c(n - 1);
}

int main(void)
{
// using namespace std::literals::chrono_literals;
auto start = std::chrono::high_resolution_clock::now();
constexpr int num = 35;
/*constexpr*/ int result_c = fibonacci_c(num);

std::cout << "Fibonacci_c " << result_c << "\n";
std::cout << "Time taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()
- start).count() << "\n";

start = std::chrono::high_resolution_clock::now();
int result = fibonacci(num);
std::cout << "Fibonacci " << result << "\n";
std::cout << "Time taken: " << std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::high_resolution_clock::now()
- start).count() << "\n";

}

<+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

It is likely that at least part of the time you are measuring cout print
time instead of calculations time you are interested in.
Also, if you are using gcc on Windows then there is a significant chance
that implementation of high_resolution_clock is broken. steady_clock is
more reliable.

Try following:
volatile dummy = 0;
auto t00 = std::chrono::steady_clock::now();// first call can be slow
dummy = 1;
auto t0 = std::chrono::steady_clock::now();
dummy = 2;
auto result_c = fibonacci_c(num);
dummy = 3;
auto t1 = std::chrono::steady_clock::now();
dummy = 4;
auto result = fibonacci(num);
dummy = 5;
auto t2 = std::chrono::steady_clock::now();
dummy = 3;

etc...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Wed Dec 18 07:42:39 2024

Michael S <[email protected]> writes:

On Wed, 18 Dec 2024 03:26:15 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

My news reader is not sure about equation above.
You probably meant ceil(pow((1+sqrt(5))/2, n)/sqrt(5))
It works well with double precision math up to nu. After that it
would need high precision fp library, so probably [much] slower
than simple loop above at least up to n' where we run out of range
of int64 result.

It seems like your newsreader don't like combinations of characters that
I considered benign. Like n=75 and n=92
In case you want to see what was an original, it is here:
[link]

Yes, that is peculiar. I should look into that. Thank you
for bringing it to my attention.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Wed Dec 18 09:45:38 2024

Michael S <[email protected]> writes:

On Tue, 17 Dec 2024 12:54:19 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

[...]

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

Here is my second fastest fibonacci calculation code (for
relatively small inputs):

typedef long unsigned long ULL;

ULL
fibonacci( unsigned n ){
ULL b = n&1, a = b^1;

if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
ULL na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}

n >>= 4;
while( n-- ){
ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}

return b;
}

From BigO perspective this code looks like it's still O(n) so no
better than simple loop.

It is O(n) but it has a smaller constant.

According to my measurement gear, in range 0 to 92 there are few
points where it is faster than simple loop, but in majority of
cases it is slower.

I'm at a loss to understand how this could happen. In my own
measurements, the code shown above runs faster than a simple loop in
all cases above n > 12, more than twice as fast when n > 17, more
than three times as fast when n > 42, and going up from there. What
might account for these radically different results?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Wed Dec 18 22:00:06 2024

On Wed, 18 Dec 2024 09:45:38 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Tue, 17 Dec 2024 12:54:19 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

[...]

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

Here is my second fastest fibonacci calculation code (for
relatively small inputs):

typedef long unsigned long ULL;

ULL
fibonacci( unsigned n ){
ULL b = n&1, a = b^1;

if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
ULL na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}

n >>= 4;
while( n-- ){
ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}

return b;
}

From BigO perspective this code looks like it's still O(n) so no
better than simple loop.

It is O(n) but it has a smaller constant.

According to my measurement gear, in range 0 to 92 there are few
points where it is faster than simple loop, but in majority of
cases it is slower.

I'm at a loss to understand how this could happen. In my own
measurements, the code shown above runs faster than a simple loop in
all cases above n > 12, more than twice as fast when n > 17, more
than three times as fast when n > 42, and going up from there. What
might account for these radically different results?

May be, number of repetitions?
I run the test only once. That gives relative advantage to smaller
code which is less sensitive to cold ICache and to cold branch
predictors.
Also, because I am running it only once, my test is not so good in
seeing differences between various "good" algorithms.
But I feel that running test once is more realistic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Thu Dec 19 00:40:32 2024

Michael S <[email protected]> writes:

On Wed, 18 Dec 2024 01:33:42 +0200
Michael S <[email protected]> wrote:

I am pretty sure that better variant exists, but for tonight it's
enough.

Morning fibs.

Recursive:

struct pair_t {
long long a,b;
};

// return .a=fib(n), .b=fib(n+1)
static struct pair_t fib2(long n)
{
if (n <= 0) {
struct pair_t ret = { .a = 0, .b = n==0 ? 1 : 0 };
return ret;
}
// for n > 0
// fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
// fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
long m = (n-1) >> 1;
struct pair_t ret = fib2(m);
long long a = ret.a, b = ret.b;
long long c = a + b;
if ((n & 1)==0) { // (m,m+1) => ((m+1)*2,(m+1)*2*2+1)
ret.a = (a + c)*b;
ret.b = b*b + c*c;
} else { // (m,m+1) => (m*2+1,(m+1)*2)
ret.a = a*a + b*b;
ret.b = (a + c)*b;
}
return ret;
}

static long long fib(long n)
{
struct pair_t x = fib2(n-1);
return x.b;
}

I may have inadvertently misled you. Here is a simple linear
formulation that uses recursion:

typedef unsigned long long NN;

static NN fibonacci_3( NN, NN, unsigned );

NN
fibonacci( unsigned n ){
return fibonacci_3( 1, 0, n );
}

NN
fibonacci_3( NN a, NN b, unsigned n ){
return n == 0 ? b : fibonacci_3( b, a+b, n-1 );
}

Can you apply this idea to your logarithmic scheme?

Iterative:

static long long fib(long n)
{
if (n <= 0)
return 0;
// find MS bit
unsigned long tmp = n;
unsigned long bit = 1;
while (tmp > 1) {
bit += bit;
tmp /= 2;
}
// for n > 0
// fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
// fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
long long a = 0, b = 1;
while (bit > 1) {
long long c = a + b; // fib(n+1)
bit /= 2;
if ((n & bit)==0) { // (n-1,n) => (n*2-1,n*2)
c += a;
a = a*a + b*b;
b = c*b;
} else { // (n-1,n) => (n*2,n*2+1)
a = (a + c)*b;
b = b*b + c*c;
}
}
return b;
}

Both variants above are O(log(n)).
In practice for n in range 0 to 92 my measurement gear does not detect differences in speed vs simple loop. If anything, simple loop is a
little faster.

Yes, the log(n) can be beaten by simpler schemes for small n.

I wrote a (recursive) log(n) fibonacci in python. It was able
to compute fibonacci( 10000000 ) in just under 3 seconds.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Thu Dec 19 00:57:20 2024

Michael S <[email protected]> writes:

On Wed, 18 Dec 2024 09:45:38 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Tue, 17 Dec 2024 12:54:19 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

[...]

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

Here is my second fastest fibonacci calculation code (for
relatively small inputs):

typedef long unsigned long ULL;

ULL
fibonacci( unsigned n ){
ULL b = n&1, a = b^1;

if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
ULL na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}

n >>= 4;
while( n-- ){
ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}

return b;
}

From BigO perspective this code looks like it's still O(n) so no
better than simple loop.

It is O(n) but it has a smaller constant.

According to my measurement gear, in range 0 to 92 there are few
points where it is faster than simple loop, but in majority of
cases it is slower.

I'm at a loss to understand how this could happen. In my own
measurements, the code shown above runs faster than a simple loop in
all cases above n > 12, more than twice as fast when n > 17, more
than three times as fast when n > 42, and going up from there. What
might account for these radically different results?

May be, number of repetitions?
I run the test only once. That gives relative advantage to smaller
code which is less sensitive to cold ICache and to cold branch
predictors.

That's an interesting idea. Can you also run a measurement where
the code is run inside loops? I think it would be instructive
to compare the results under the two approaches.

(The server I use seems not to have the necessary support to
measure very fast code segments that are run only once.)

Also, because I am running it only once, my test is not so good in
seeing differences between various "good" algorithms.
But I feel that running test once is more realistic.

I think an argument could be made that repeatedly running the code
in a loop gives a more relevant result. Any short code segment that
is run only a handful of times will never have a significant impact
on overall program performance. It's better to optimize under the
assumption that the code will be "hot": if it is hot then the
optimization will be worthwhile, and if it isn't hot then the loss
will be way down in the noise and not worth worrying about.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Student Project@21:1/5 to Michael S on Thu Dec 19 19:14:19 2024

On 18/12/2024 11:51, Michael S wrote:

Iterative:

static long long fib(long n)
{
if (n <= 0)
return 0;
// find MS bit
unsigned long tmp = n;
unsigned long bit = 1;
while (tmp > 1) {
bit += bit;
tmp /= 2;
}
// for n > 0
// fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
// fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
long long a = 0, b = 1;
while (bit > 1) {
long long c = a + b; // fib(n+1)
bit /= 2;
if ((n & bit)==0) { // (n-1,n) => (n*2-1,n*2)
c += a;
a = a*a + b*b;
b = c*b;
} else { // (n-1,n) => (n*2,n*2+1)
a = (a + c)*b;
b = b*b + c*c;
}
}
return b;
}

Here is my test result of this iterative function:

D:\CmdLine\C_Cpp\Chrono06>program
Testing fibonacci function
fib(92) = 7540113804746346429
Time taken by fib: 0 microseconds

D:\CmdLine\C_Cpp\Chrono06>program
Testing fibonacci function
fib(92) = 7540113804746346429
Time taken by fib: 0 microseconds

D:\CmdLine\C_Cpp\Chrono06>program
Testing fibonacci function
fib(92) = 7540113804746346429
Time taken by fib: 0 microseconds

D:\CmdLine\C_Cpp\Chrono06>program
Testing fibonacci function
fib(92) = 7540113804746346429
Time taken by fib: 0 microseconds

The full program to run in G++ is:

<+++++++++++++++++++++++++++++++++++++++++++++>

#include <iostream>
#include <chrono>

using namespace std;
using namespace std::chrono;

long long fib(long n)
{
if (n <= 0)
return 0;
// find MS bit
unsigned long tmp = n;
unsigned long bit = 1;
while (tmp > 1)
{
bit += bit;
tmp /= 2;
}
// for n > 0
// fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
// fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
long long a = 0, b = 1;
while (bit > 1)
{
long long c = a + b; // fib(n+1)
bit /= 2;
if ((n & bit) == 0)
{ // (n-1,n) => (n*2-1,n*2)
c += a;
a = a * a + b * b;
b = c * b;
}
else
{ // (n-1,n) => (n*2,n*2+1)
a = (a + c) * b;
b = b * b + c * c;
}
}
return b;
}

int main(void)
{
cout << "Testing fibonacci function\n";
auto start = high_resolution_clock::now();
constexpr int num = 92;
long long result = fib (num);
auto stop = high_resolution_clock::now();

auto duration = duration_cast<microseconds>(stop - start);
cout << "fib(" << num << ") = " << result << "\n";
cout << "Time taken by fib: " << duration.count() << " microseconds\n";

}

<+++++++++++++++++++++++++++++++++++++++++++++>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Thu Dec 19 23:45:49 2024

On Thu, 19 Dec 2024 00:57:20 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Wed, 18 Dec 2024 09:45:38 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Tue, 17 Dec 2024 12:54:19 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

[...]

#include <stdio.h>
#include <stdlib.h>
#include <intrin.h>

static long long fib(long n)
{
if (fib <= 0)
return 0;
long long f0 = 0, f1 = 1;
for (long i = 1; i < n; ++i) {
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
return f1;
}

Here is my second fastest fibonacci calculation code (for
relatively small inputs):

typedef long unsigned long ULL;

ULL
fibonacci( unsigned n ){
ULL b = n&1, a = b^1;

if( n & 2 ) a += b, b += a;
if( n & 4 ) a += b, b += a, a += b, b += a;
if( n & 8 ){
ULL na = 13*a+21*b, nb = 21*a+34*b;
a = na, b = nb;
}

n >>= 4;
while( n-- ){
ULL na = 610*a + 987*b, nb = 987*a + 1597*b;
a = na, b = nb;
}

return b;
}

From BigO perspective this code looks like it's still O(n) so no
better than simple loop.

It is O(n) but it has a smaller constant.

According to my measurement gear, in range 0 to 92 there are few
points where it is faster than simple loop, but in majority of
cases it is slower.

I'm at a loss to understand how this could happen. In my own
measurements, the code shown above runs faster than a simple loop
in all cases above n > 12, more than twice as fast when n > 17,
more than three times as fast when n > 42, and going up from
there. What might account for these radically different results?

May be, number of repetitions?
I run the test only once. That gives relative advantage to smaller
code which is less sensitive to cold ICache and to cold branch
predictors.

That's an interesting idea. Can you also run a measurement where
the code is run inside loops? I think it would be instructive
to compare the results under the two approaches.

(The server I use seems not to have the necessary support to
measure very fast code segments that are run only once.)

Also, because I am running it only once, my test is not so good in
seeing differences between various "good" algorithms.
But I feel that running test once is more realistic.

I think an argument could be made that repeatedly running the code
in a loop gives a more relevant result. Any short code segment that
is run only a handful of times will never have a significant impact
on overall program performance. It's better to optimize under the
assumption that the code will be "hot": if it is hot then the
optimization will be worthwhile, and if it isn't hot then the loss
will be way down in the noise and not worth worrying about.

I feel that running fib(n) with the same n in loop too unrealistic.
So I decided to run, at least, with different values of n.
Ended up spending about a hour just to build a test bench.

The answer is - in a loop of more than dozen iterations your code is
indeed faster. Esp. so for hundred or more iterations.

Here is my test bench.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <intrin.h>

long long fib(long n);
volatile long long dummy;

int main(int argz, char** argv)
{
if (argz < 2) {
printf("Go away!\n");
return 1;
}

char* endp;
long n = strtol(argv[1], &endp, 0);
if (endp == argv[1]) {
printf("%s is not a number. Go away!\n", argv[1]);
return 1;
}

if (argz == 2) {
unsigned long long t0 = _rdtsc();
long long f = fib(n);
unsigned long long t1 = _rdtsc();
printf("fib(%ld)=%lld. %lld clock cycles.\n", n, f, t1 - t0);
return 0;
}

// argz > 2
if (argv[2][0]=='t') {
long long f0 = 0, f1 = 1;
for (long i = 0; i <= n; ++i) {
long long res = fib(i);
if (res != f0) {
printf("Mistake at n = %ld. %lld instead of %lld\n"
, i, res, f0);
return 1;
}
long long f2 = f0 + f1;
f0 = f1;
f1 = f2;
}
printf("o.k.\n");
return 0;
}

long m = strtol(argv[2], &endp, 0);
if (endp == argv[2]) {
printf("%s is not a number. Go away!\n", argv[2]);
return 1;
}

if (n > m) {
long tmp = n; n = m; m = tmp;
}

unsigned long dn = m - n + 1;
if (dn > 1000000) {
printf(
"Range [%ld..%ld] is wider than 1M. Unsupported. I am sorry.\n"
, n, m);
return 1;
}

size_t n_iter = 10;
if (argz > 3) { // get non-default number of iterations
unsigned long long val = strtoull(argv[3], &endp, 0);
if (endp == argv[3]) {
printf("%s is not a number. Go away!\n", argv[3]);
return 1;
}
if (val < 1 || val > 10000000) {
printf(
"number of iterations %s is out of range "
"[1..10,000,000]. Go away!\n", argv[3]);
return 1;
}
n_iter = val;
}

long* an = malloc(n_iter*sizeof(long));
if (an) {
// prepare array of random arguments
srand(1);
for (size_t i = 0; i < n_iter; ++i)
an[i] = (long)((((uint64_t)(rand() & 0x7fff) * dn) >> 15) + n);

dummy = 0;
unsigned long long t0 = _rdtsc();
for (size_t i = 0; i < n_iter; ++i)
dummy += fib(an[i]);
unsigned long long t1 = _rdtsc();
dummy = 0;

printf(
"n in range [%ld..%ld]. %zu iterations."
" %llu clocks. %llu clocks/iter\n"
,n, m, n_iter, t1-t0, (t1-t0)/n_iter);

free(an);
} else {
perror("malloc()");
return 1;
}

return 0;
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Fri Dec 20 02:17:03 2024

On Thu, 19 Dec 2024 00:40:32 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Wed, 18 Dec 2024 01:33:42 +0200
Michael S <[email protected]> wrote:

I am pretty sure that better variant exists, but for tonight it's
enough.

Morning fibs.

Recursive:

struct pair_t {
long long a,b;
};

// return .a=fib(n), .b=fib(n+1)
static struct pair_t fib2(long n)
{
if (n <= 0) {
struct pair_t ret = { .a = 0, .b = n==0 ? 1 : 0 };
return ret;
}
// for n > 0
// fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
// fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
long m = (n-1) >> 1;
struct pair_t ret = fib2(m);
long long a = ret.a, b = ret.b;
long long c = a + b;
if ((n & 1)==0) { // (m,m+1) => ((m+1)*2,(m+1)*2*2+1)
ret.a = (a + c)*b;
ret.b = b*b + c*c;
} else { // (m,m+1) => (m*2+1,(m+1)*2)
ret.a = a*a + b*b;
ret.b = (a + c)*b;
}
return ret;
}

static long long fib(long n)
{
struct pair_t x = fib2(n-1);
return x.b;
}

I may have inadvertently misled you. Here is a simple linear
formulation that uses recursion:

typedef unsigned long long NN;

static NN fibonacci_3( NN, NN, unsigned );

NN
fibonacci( unsigned n ){
return fibonacci_3( 1, 0, n );
}

NN
fibonacci_3( NN a, NN b, unsigned n ){
return n == 0 ? b : fibonacci_3( b, a+b, n-1 );
}

Can you apply this idea to your logarithmic scheme?

Not really. But I can cheat.
Below is slightly modified iterative algorithm coded as recursion.

static
long long fib_3(long long a, long long b, unsigned long rev_n) {
if (rev_n == 1)
return b;

long long c = a + b;
if ((rev_n & 1) == 0)
return fib_3(a*a+b*b, (a+c)*b, rev_n/2);
else
return fib_3((a+c)*b, b*b+c*c, rev_n/2);
}

static
long long fib_bitrev(unsigned long a, unsigned long b) {
return a==1 ? fib_3(0, 1, b) : fib_bitrev( a / 2, b * 2 + a % 2);
}

long long fib(long n) {
return n <= 0 ? 0 : fib_bitrev(n, 1);
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Fri Dec 20 14:31:54 2024

Michael S <[email protected]> writes:

On Thu, 19 Dec 2024 00:57:20 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Wed, 18 Dec 2024 09:45:38 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

[...]

According to my measurement gear, in range 0 to 92 there are few
points where [the code I posted] is faster than simple loop, but
in majority of cases it is slower.

I'm at a loss to understand how this could happen. In my own
measurements, the code shown above runs faster than a simple loop
in all cases above n > 12, more than twice as fast when n > 17,
more than three times as fast when n > 42, and going up from
there. What might account for these radically different results?

May be, number of repetitions?
I run the test only once. That gives relative advantage to smaller
code which is less sensitive to cold ICache and to cold branch
predictors.

That's an interesting idea. Can you also run a measurement where
the code is run inside loops? I think it would be instructive
to compare the results under the two approaches.

[...]

I feel that running fib(n) with the same n in loop too unrealistic.
So I decided to run, at least, with different values of n.

I agree, that is a better choice for a measurement test load.

Ended up spending about a hour just to build a test bench.

The answer is - in a loop of more than dozen iterations your code is
indeed faster. Esp. so for hundred or more iterations.

I'm happy to learn that my instincts were validated here. And I
also learned something, about the effects of code warmup. Very
interesting.

Here is my test bench. [...]

Thank you for that. Sadly I am not able to run it because my
test environment is lacking _rdtsc(), and I haven't been able
to find out how to install it.

By the way, I send you an email earlier today. Did that get to
you?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Fri Dec 20 14:59:07 2024

Michael S <[email protected]> writes:

On Thu, 19 Dec 2024 00:40:32 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Wed, 18 Dec 2024 01:33:42 +0200
Michael S <[email protected]> wrote:

[a recursive logarithmic formulation]

I may have inadvertently misled you. Here is a simple linear
formulation that uses recursion:

typedef unsigned long long NN;

static NN fibonacci_3( NN, NN, unsigned );

NN
fibonacci( unsigned n ){
return fibonacci_3( 1, 0, n );
}

NN
fibonacci_3( NN a, NN b, unsigned n ){
return n == 0 ? b : fibonacci_3( b, a+b, n-1 );
}

Can you apply this idea to your logarithmic scheme?

Not really. But I can cheat.
Below is slightly modified iterative algorithm coded as recursion.

I don't think of your code below as cheating at all. It resembles
an earlier recursive version that I did, and is pretty much the
sort of thing I had in mind.

static
long long fib_3(long long a, long long b, unsigned long rev_n) {
if (rev_n == 1)
return b;

long long c = a + b;
if ((rev_n & 1) == 0)
return fib_3(a*a+b*b, (a+c)*b, rev_n/2);
else
return fib_3((a+c)*b, b*b+c*c, rev_n/2);
}

The factoring with 'c' is a nice improvement. Thank you for
that.

static
long long fib_bitrev(unsigned long a, unsigned long b) {
return a==1 ? fib_3(0, 1, b) : fib_bitrev( a / 2, b * 2 + a % 2);
}

long long fib(long n) {
return n <= 0 ? 0 : fib_bitrev(n, 1);
}

The intermediate function fib_bitrev() reverses the bits so they
can be processed from high to low. I did that by passing two
arguments, a mask and the value of n, and shifting the mask.
Combining that with your 'c' expression factoring, I got this:

typedef unsigned long long NN;

static unsigned high_bit( unsigned );
static NN lfib4( NN, NN, unsigned, unsigned );

NN
logarithmic_fibonacci( unsigned n ){
return lfib4( 1, 0, high_bit( n ), n );
}

unsigned
high_bit( unsigned n ){
n |= n>>1;
n |= n>>2;
n |= n>>4;;
n |= n>>8;;
n |= n>>16;;
return n ^ n>>1;
}

NN
lfib4( NN a, NN b, unsigned m, unsigned n ){
NN c = a+b;
return
m & n ? lfib4( (a+c)*b, b*b+c*c, m>>1, n ) :
m ? lfib4( a*a+b*b, (a+c)*b, m>>1, n ) :
/*****/ b;
}

I ran a python version to compute fibonacci( 10000000 ). This code
ran 30% faster than the previous version, which I would be is
almost entirely due to the expression factoring with 'c'. Great!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Sun Dec 22 00:14:56 2024

On Fri, 20 Dec 2024 14:31:54 -0800
Tim Rentsch <[email protected]> wrote:

Thank you for that. Sadly I am not able to run it because my
test environment is lacking _rdtsc(), and I haven't been able
to find out how to install it.

Any x86-64 Linux should have _rdtsc() as inline function (or built-in)
defined in x86intrin.h. No installation required.

By the way, I send you an email earlier today. Did that get to
you?

Yes, I got it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Sun Dec 22 00:07:08 2024

On Fri, 20 Dec 2024 14:59:07 -0800
Tim Rentsch <[email protected]> wrote:

I ran a python version to compute fibonacci( 10000000 ). This code
ran 30% faster than the previous version, which I would be is
almost entirely due to the expression factoring with 'c'. Great!

This one does fib(10M) in 89 msec on my very old home PC.

#include "gmp.h"

void fib(mpz_t rop, long n)
{
if (n <= 0) {
mpz_set_ui(rop, 0);
return;
}

// find MS bit
unsigned long tmp = n;
unsigned long bit = 1;
while (tmp > 1) {
bit += bit;
tmp /= 2;
}
// for n > 0
// fib(n*2) = (fib(n-1)+fib(n+1))*fib(n)
// fib(n*2+1) = fib(n)*fib(n)+fib(n+1)*fib(n+1)
mpz_t a, b, aa, bb;
mpz_init_set_si(a, 0);
mpz_init_set_si(b, 1);
mpz_init(aa);
mpz_init(bb);
while (bit > 1) {
mpz_mul(aa, a, a);
mpz_mul(bb, b, b);
mpz_add(aa, aa, bb); // aa = a*a+b*b

mpz_add(a, a, a);
mpz_add(a, a, b);
mpz_mul(bb, a, b); // bb = (a*2+b)*b

bit /= 2;
if (n & bit) {
mpz_add(aa, aa, bb);
mpz_swap(aa, bb);
}
mpz_swap(a, aa);
mpz_swap(b, bb);
}
mpz_swap(rop, b);
mpz_clear(a);
mpz_clear(b);
mpz_clear(aa);
mpz_clear(bb);
}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Kuyper@21:1/5 to Michael S on Sat Dec 21 17:55:20 2024

On 12/21/24 17:14, Michael S wrote:

On Fri, 20 Dec 2024 14:31:54 -0800
Tim Rentsch <[email protected]> wrote:

Thank you for that. Sadly I am not able to run it because my
test environment is lacking _rdtsc(), and I haven't been able
to find out how to install it.

Any x86-64 Linux should have _rdtsc() as inline function (or built-in) defined in x86intrin.h. No installation required.

On my x86-64 Linux system, that header (indirectly) declares __rdtsc(),
but not _rdtsc.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to James Kuyper on Sun Dec 22 02:22:14 2024

On Sat, 21 Dec 2024 17:55:20 -0500
James Kuyper <[email protected]> wrote:

On 12/21/24 17:14, Michael S wrote:

On Fri, 20 Dec 2024 14:31:54 -0800
Tim Rentsch <[email protected]> wrote:

Thank you for that. Sadly I am not able to run it because my
test environment is lacking _rdtsc(), and I haven't been able
to find out how to install it.

Any x86-64 Linux should have _rdtsc() as inline function (or
built-in) defined in x86intrin.h. No installation required.

On my x86-64 Linux system, that header (indirectly) declares
__rdtsc(), but not _rdtsc.

Thank you.
I'd guess I knew it once, but because I don't run microbenchmark on
Linux all that often, had forgotten.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Dec 22 10:33:57 2024

Michael S <[email protected]> writes:

On Fri, 20 Dec 2024 14:31:54 -0800
Tim Rentsch <[email protected]> wrote:

Thank you for that. Sadly I am not able to run it because my
test environment is lacking _rdtsc(), and I haven't been able
to find out how to install it.

Any x86-64 Linux should have _rdtsc() as inline function (or built-in) defined in x86intrin.h. No installation required.

Okay. As it turns out knowing that doesn't help me, but thank
you for the info. I'm taking this as a sign from heaven that
I should be spending my energies somewhere other than measuring
performance.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Dec 22 10:25:59 2024

Michael S <[email protected]> writes:

On Fri, 20 Dec 2024 14:59:07 -0800
Tim Rentsch <[email protected]> wrote:

I ran a python version to compute fibonacci( 10000000 ). This code
ran 30% faster than the previous version, which I would be is
almost entirely due to the expression factoring with 'c'. Great!

This one does fib(10M) in 89 msec on my very old home PC.

[code]

You are the speed king!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Mon Dec 23 00:05:03 2024

On Sun, 22 Dec 2024 10:33:57 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Fri, 20 Dec 2024 14:31:54 -0800
Tim Rentsch <[email protected]> wrote:

Thank you for that. Sadly I am not able to run it because my
test environment is lacking _rdtsc(), and I haven't been able
to find out how to install it.

Any x86-64 Linux should have _rdtsc() as inline function (or
built-in) defined in x86intrin.h. No installation required.

Okay. As it turns out knowing that doesn't help me, but thank
you for the info. I'm taking this as a sign from heaven that
I should be spending my energies somewhere other than measuring
performance.

Did you see correction of James Kuyper?
Under Linux it is called __rdtsc().

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Sun Dec 22 17:17:41 2024

Michael S <[email protected]> writes:

On Sun, 22 Dec 2024 10:33:57 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Fri, 20 Dec 2024 14:31:54 -0800
Tim Rentsch <[email protected]> wrote:

Thank you for that. Sadly I am not able to run it because my
test environment is lacking _rdtsc(), and I haven't been able
to find out how to install it.

Any x86-64 Linux should have _rdtsc() as inline function (or
built-in) defined in x86intrin.h. No installation required.

Okay. As it turns out knowing that doesn't help me, but thank
you for the info. I'm taking this as a sign from heaven that
I should be spending my energies somewhere other than measuring
performance.

Did you see correction of James Kuyper?
Under Linux it is called __rdtsc().

Yes I saw that.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Mon Dec 23 17:47:08 2024

On Sun, 22 Dec 2024 10:25:59 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Fri, 20 Dec 2024 14:59:07 -0800
Tim Rentsch <[email protected]> wrote:

I ran a python version to compute fibonacci( 10000000 ). This code
ran 30% faster than the previous version, which I would be is
almost entirely due to the expression factoring with 'c'. Great!

This one does fib(10M) in 89 msec on my very old home PC.

[code]

You are the speed king!

But approximately the same code in python is MUCH slower.
2.9 sec on 11 y.o. PC that still serves me as a desktop at work.
2.0 sec on 5.5 y.o. mall server.

I don't really understand why.
I expected that nearly all time is spend in multiplication of few huge
numbers, probably over 75% of time in last couple of steps (6
multiplications). And I was under impression that internally Python
uses the same GMP library that I used in C. So I expected the same
speed, more or less. May be, 1.5x slowdown because of less efficient
memory handling in Python. 35x difference is a big surprise.

Here is my python routine.

def fib(n):
if n < 1:
return 0

tmp = n
msb = 1
while tmp > 1:
msb += msb
tmp //= 2

f0=0; f1=1;
while msb > 1:
ff0 = f0*f0 + f1*f1
f1 = (f0+f0+f1)*f1
f0 = ff0
msb //= 2
if msb & n:
f0 = f1
f1 += ff0
return f1

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Tue Dec 24 05:21:42 2024

Michael S <[email protected]> writes:

On Sun, 22 Dec 2024 10:25:59 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Fri, 20 Dec 2024 14:59:07 -0800
Tim Rentsch <[email protected]> wrote:

I ran a python version to compute fibonacci( 10000000 ). This code
ran 30% faster than the previous version, which I would be is
almost entirely due to the expression factoring with 'c'. Great!

This one does fib(10M) in 89 msec on my very old home PC.

[code]

You are the speed king!

But approximately the same code in python is MUCH slower.
2.9 sec on 11 y.o. PC that still serves me as a desktop at work.
2.0 sec on 5.5 y.o. mall server.

That matches my experience with doing this in python.

I don't really understand why.
I expected that nearly all time is spend in multiplication of few huge numbers, probably over 75% of time in last couple of steps (6 multiplications). And I was under impression that internally Python
uses the same GMP library that I used in C. So I expected the same
speed, more or less. May be, 1.5x slowdown because of less efficient
memory handling in Python. 35x difference is a big surprise.

Yes, I have no explanation either. Your reasoning looks sound to
me.

Here is my python routine.

def fib(n):
if n < 1:
return 0

tmp = n
msb = 1
while tmp > 1:
msb += msb
tmp //= 2

f0=0; f1=1;
while msb > 1:
ff0 = f0*f0 + f1*f1
f1 = (f0+f0+f1)*f1
f0 = ff0
msb //= 2
if msb & n:
f0 = f1
f1 += ff0
return f1

Here is my latest python code.

def fibonacci( n ) :
return ff2( 1, 0, high_mask( 1, n ), n )

def high_mask( m, n ) :
return m>>1 if m > n else high_mask( m<<1, n )

def ff2( a, b, m, n ) :
c = a+b
if m & n : return ff2( (a+c)*b, b*b+c*c, m>>1, n )
if m : return ff2( a*a+b*b, (a+c)*b, m>>1, n )
return b

Performance trials of these two definitions gave nearly identical
results, within 0.4 percent.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Tim Rentsch on Wed Dec 25 11:05:05 2024

On Tue, 24 Dec 2024 05:21:42 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Sun, 22 Dec 2024 10:25:59 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Fri, 20 Dec 2024 14:59:07 -0800
Tim Rentsch <[email protected]> wrote:

I ran a python version to compute fibonacci( 10000000 ). This
code ran 30% faster than the previous version, which I would be
is almost entirely due to the expression factoring with 'c'.
Great!

This one does fib(10M) in 89 msec on my very old home PC.

[code]

You are the speed king!

But approximately the same code in python is MUCH slower.
2.9 sec on 11 y.o. PC that still serves me as a desktop at work.
2.0 sec on 5.5 y.o. mall server.

That matches my experience with doing this in python.

I don't really understand why.
I expected that nearly all time is spend in multiplication of few
huge numbers, probably over 75% of time in last couple of steps (6 multiplications). And I was under impression that internally Python
uses the same GMP library that I used in C. So I expected the same
speed, more or less. May be, 1.5x slowdown because of less
efficient memory handling in Python. 35x difference is a big
surprise.

Yes, I have no explanation either. Your reasoning looks sound to
me.

I did few more time measurements both with GMP and with Python. It
clearly showed that they use not just different implementations of
bignum multiplication, but different algorithms.
GMP multiplication for operands in range of above ~200,000 bits is approximately O(N*log(N)) with number of bits, or just a little bit
slower than that.
Python multiplication scale much worse, approximately as (N**1.55).

And indeed after binging more I had seen the following discussion on
reddit: https://www.reddit.com/r/Python/comments/7h62ul/python_largeinteger_computation_speed_a_comparison/

A relevant quote:
"Python seems to use two algorithms for bignum multiplication: "the
O(N**2) school algorithm" and Karatsuba if both operands contain more
than 70 decimal digits. (Your example reaches 70 decimal digits already
after the 5th(?) squaring, and after the 22nd there are ~17million
decimal digits(?). The wast majority of the time is spent on the last
one or two multiplications i.e. Karatsuba.)

Meanwhile GMP implements seven multiplication algorithms: The same(?)
"school" and Karatsuba algorithms, plus various Toom and an FFT method (Strassen), but I'm unclear on what the exact thresholds are that
select these. It seems these thresholds are in "limbs" ("the part of a multi-precision number that fits in a single machine word") instead of
decimal digits, with e.g. FFT method being used after "somewhere
between 3000 and 10000 limbs", i.e. >~30k - 100k(?) decimal digits, so
it would probably be used for the relevant steps in your example."

It seem that that threshold for FFT method mentioned in the quote
applies to number of limbs (i.e. 64-bit groups of bits in our specific
case) in the product rather than in the operands. Also it is possible
that in the newer versions of GMP the threshold is a lower than it
was 7 years ago.

I wonder, why python uses its own thing instead of GMP.
Incompatible licenses or plain NIH ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Michael S on Wed Dec 25 13:33:08 2024

On Wed, 25 Dec 2024 11:05:05 +0200
Michael S <[email protected]> wrote:

On Tue, 24 Dec 2024 05:21:42 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Sun, 22 Dec 2024 10:25:59 -0800
Tim Rentsch <[email protected]> wrote:

Michael S <[email protected]> writes:

On Fri, 20 Dec 2024 14:59:07 -0800
Tim Rentsch <[email protected]> wrote:

I ran a python version to compute fibonacci( 10000000 ). This
code ran 30% faster than the previous version, which I would be
is almost entirely due to the expression factoring with 'c'.
Great!

This one does fib(10M) in 89 msec on my very old home PC.

[code]

You are the speed king!

But approximately the same code in python is MUCH slower.
2.9 sec on 11 y.o. PC that still serves me as a desktop at work.
2.0 sec on 5.5 y.o. mall server.

That matches my experience with doing this in python.

I don't really understand why.
I expected that nearly all time is spend in multiplication of few
huge numbers, probably over 75% of time in last couple of steps (6 multiplications). And I was under impression that internally
Python uses the same GMP library that I used in C. So I expected
the same speed, more or less. May be, 1.5x slowdown because of
less efficient memory handling in Python. 35x difference is a big surprise.

Yes, I have no explanation either. Your reasoning looks sound to
me.

I did few more time measurements both with GMP and with Python. It
clearly showed that they use not just different implementations of
bignum multiplication, but different algorithms.
GMP multiplication for operands in range of above ~200,000 bits is approximately O(N*log(N)) with number of bits, or just a little bit
slower than that.
Python multiplication scale much worse, approximately as (N**1.55).

And indeed after binging more I had seen the following discussion on
reddit: https://www.reddit.com/r/Python/comments/7h62ul/python_largeinteger_computation_speed_a_comparison/

A relevant quote:
"Python seems to use two algorithms for bignum multiplication: "the
O(N**2) school algorithm" and Karatsuba if both operands contain more
than 70 decimal digits. (Your example reaches 70 decimal digits
already after the 5th(?) squaring, and after the 22nd there are
~17million decimal digits(?). The wast majority of the time is spent
on the last one or two multiplications i.e. Karatsuba.)

Meanwhile GMP implements seven multiplication algorithms: The same(?) "school" and Karatsuba algorithms, plus various Toom and an FFT method (Strassen), but I'm unclear on what the exact thresholds are that
select these. It seems these thresholds are in "limbs" ("the part of a multi-precision number that fits in a single machine word") instead of decimal digits, with e.g. FFT method being used after "somewhere
between 3000 and 10000 limbs", i.e. >~30k - 100k(?) decimal digits, so
it would probably be used for the relevant steps in your example."

It seem that that threshold for FFT method mentioned in the quote
applies to number of limbs (i.e. 64-bit groups of bits in our specific
case) in the product rather than in the operands. Also it is possible
that in the newer versions of GMP the threshold is a lower than it
was 7 years ago.

I wonder, why python uses its own thing instead of GMP.
Incompatible licenses or plain NIH ?

Forgot to mention.
As said in the same redit thread, one can use GMP from python. It's
pretty easy, just not the default.

import gmpy2

def fib(n):
if n < 1:
return 0

tmp = n
msb = 1
while tmp > 1:
msb += msb
tmp //= 2
f0=gmpy2.mpz(0); f1=gmpy2.mpz(1);
while msb > 1:
ff0 = f0*f0 + f1*f1
f1 = (f0+f0+f1)*f1
f0 = ff0
msb //= 2
if msb & n:
f0 = f1
f1 += ff0
return f1

For big values of n gmpy2-based variant calculates Fibbonaci numbers at approximately the same speed as C variant. For smaller numbers - not
quite the same as C and for n < ~10,000 it is slower than default
python math.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Thu Dec 26 18:54:59 2024

Michael S <[email protected]> writes:

[GMP in python?]

Forgot to mention.
As said in the same redit thread, one can use GMP from python. It's
pretty easy, just not the default.

[example]

Thank you, that is good to know.

For big values of n gmpy2-based variant calculates Fibbonaci numbers at approximately the same speed as C variant. For smaller numbers - not
quite the same as C and for n < ~10,000 it is slower than default
python math.

So now we need a hybrid method where "small" numbers use default
python algorithms and advance to GMP for larger numbers, with each
of those two already being hybrids? A hybrid**2 algorithm? I like
it. ;)

By the way, in the last week or so I sent you two emails. Could I
ask you to send a short email in response (assuming they arrived,
which I expect they did)?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to Michael S on Thu Dec 26 18:42:35 2024

Michael S <[email protected]> writes:

On Tue, 24 Dec 2024 05:21:42 -0800
Tim Rentsch <[email protected]> wrote:

[python runs much slower than GMP]

Yes, I have no explanation either. Your reasoning looks sound to
me.

I did few more time measurements both with GMP and with Python. It
clearly showed that they use not just different implementations of
bignum multiplication, but different algorithms.
GMP multiplication for operands in range of above ~200,000 bits is approximately O(N*log(N)) with number of bits, or just a little bit
slower than that.
Python multiplication scale much worse, approximately as (N**1.55).

That's consistent with using Karatsuba: log2( 3 ) =~ 1.585.

And indeed after binging more I had seen the following discussion on
reddit: [link]

A relevant quote:
"Python seems to use two algorithms for bignum multiplication: "the
O(N**2) school algorithm" and Karatsuba if both operands contain more
than 70 decimal digits. (Your example reaches 70 decimal digits already after the 5th(?) squaring, and after the 22nd there are ~17million
decimal digits(?). The wast majority of the time is spent on the last
one or two multiplications i.e. Karatsuba.)

Meanwhile GMP implements seven multiplication algorithms: The same(?) "school" and Karatsuba algorithms, plus various Toom and an FFT method (Strassen), but I'm unclear on what the exact thresholds are that
select these. It seems these thresholds are in "limbs" ("the part of a multi-precision number that fits in a single machine word") instead of decimal digits, with e.g. FFT method being used after "somewhere
between 3000 and 10000 limbs", i.e. >~30k - 100k(?) decimal digits, so
it would probably be used for the relevant steps in your example."

That ratio suggests the limbs are 32 bits (ie, roughly 10 digits each).

It seem that that threshold for FFT method mentioned in the quote
applies to number of limbs (i.e. 64-bit groups of bits in our specific
case) in the product rather than in the operands. Also it is possible
that in the newer versions of GMP the threshold is a lower than it
was 7 years ago.

I have no intuition regarding whether a threshold would be chosen
based on the sizes of the operands or the sizes of the product. For
this particular problem it's only one iteration difference anyway.

I wonder, why python uses its own thing instead of GMP.
Incompatible licenses or plain NIH ?

Perhaps (1) python doesn't want to be bound by the rather severe
terms of a GNU license, or (2) historical inertia.

Let me add that I appreciate your excellent investigation results.
They reinforce my view that it was wise for me to cease the deep
dives into performance issues. :)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet
- Guest
  Wed Jul 29 14:26:54 2026
  from Balkans via Telnet
- Rixter
  Wed Jul 29 14:18:17 2026
  from Madison, Nc via Telnet
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	83:34:14
Calls:	12,451
Calls today:	1
Files:	15,194
Messages:	6,537,782

constexpr is really very smart!

Who's Online

Recent Visitors

System Info