Forum: >>> Magnum BBS <<<

Re: Most performant way to have "r % 2 ? -1.0 : 1.0"

From Marcel Mueller@21:1/5 to All on Wed May 1 16:11:14 2024

Am 30.04.24 um 11:34 schrieb Bonita Montero:

bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

There is no need for conditional:

-((int)r&1) | 1

Marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Bonita Montero on Wed May 1 15:39:49 2024

Bonita Montero <[email protected]> writes:

Am 01.05.2024 um 16:11 schrieb Marcel Mueller:

Am 30.04.24 um 11:34 schrieb Bonita Montero:

bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

There is no need for conditional:

-((int)r&1) | 1

Marcel

Looks simpler, but your code is one instruction more:

my code:
movabsq $4607182418800017408, %rax
salq $63, %rdi
orq %rax, %rdi
movq %rdi, %xmm0

your code:
andl $1, %edi
pxor %xmm0, %xmm0
negl %edi
orl $1, %edi
cvtsi2sdl %edi, %xmm0

the pxor is "executed" in the fetch stage, so adds no latency
(it is effectively a NOP).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bonita Montero on Wed May 1 18:46:31 2024

On 01/05/2024 17:14, Bonita Montero wrote:

Am 01.05.2024 um 17:05 schrieb Bonita Montero:

Am 01.05.2024 um 16:11 schrieb Marcel Mueller:

Am 30.04.24 um 11:34 schrieb Bonita Montero:

bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

There is no need for conditional:

-((int)r&1) | 1

Marcel

Looks simpler, but your code is one instruction more:

my code:
         movabsq $4607182418800017408, %rax
         salq    $63, %rdi
         orq     %rax, %rdi
         movq    %rdi, %xmm0

your code:
         andl    $1, %edi
         pxor    %xmm0, %xmm0
         negl    %edi
         orl     $1, %edi
         cvtsi2sdl       %edi, %xmm0

And I've seen that cvtsi2sdl has a six cylcle latency on my
Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.

That's the kind of thing that is at least vaguely relevant here. The
number of instructions means little (especially when it is not clear
that your code involves fewer bytes) - the time taken for these
instructions is the important thing for claims of "most performant" code.

But even with that, all you've got is a claim that one obscure
expression might be slightly faster than some other obscure expression,
on some particular machine. No one cares if a program runs a few
nanoseconds faster - certainly not enough to accept a re-write like
yours (or Marcel's).

What you need to do is find some reason for wanting the original
expression, where you need to evaluate it vast numbers of times.
Perhaps use it in a big vector calculation or DSP algorithm. Write the
full code out. Time it on a real machine, first with the original
expression (from the subject line), then with your version, then with
Marcel's. Compile with a good optimising compiler (such gcc or clang,
not MSVC) with vectorisation appropriate for the processor you are using
(use "-march=native"). Time the differences.

My guess is that in real code, the difference will be negligible, but
that the original simple and clear expression has the edge because it
lets the compiler do more optimisations.

Feel free to prove my guess wrong - it is, after all, just a guess.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From red floyd@21:1/5 to David Brown on Wed May 1 18:23:15 2024

On 5/1/2024 9:46 AM, David Brown wrote:

On 01/05/2024 17:14, Bonita Montero wrote:

Am 01.05.2024 um 17:05 schrieb Bonita Montero:

Am 01.05.2024 um 16:11 schrieb Marcel Mueller:

Am 30.04.24 um 11:34 schrieb Bonita Montero:

bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

There is no need for conditional:

-((int)r&1) | 1

[redacted]

What the heck, I'll throw this in:

-1 + ((r & 1) << 1)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From red floyd@21:1/5 to red floyd on Wed May 1 18:25:19 2024

On 5/1/2024 6:23 PM, red floyd wrote:

On 5/1/2024 9:46 AM, David Brown wrote:

On 01/05/2024 17:14, Bonita Montero wrote:

Am 01.05.2024 um 17:05 schrieb Bonita Montero:

Am 01.05.2024 um 16:11 schrieb Marcel Mueller:

Am 30.04.24 um 11:34 schrieb Bonita Montero:

bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

There is no need for conditional:

-((int)r&1) | 1

[redacted]

What the heck, I'll throw this in:

-1 + ((r & 1) << 1)

Oops. Sign reversal.

1 - ((r & 1) << 1)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bonita Montero on Thu May 2 16:04:37 2024

On 02/05/2024 05:49, Bonita Montero wrote:

Am 02.05.2024 um 03:23 schrieb red floyd:

On 5/1/2024 9:46 AM, David Brown wrote:

On 01/05/2024 17:14, Bonita Montero wrote:

Am 01.05.2024 um 17:05 schrieb Bonita Montero:

Am 01.05.2024 um 16:11 schrieb Marcel Mueller:

Am 30.04.24 um 11:34 schrieb Bonita Montero:

bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

There is no need for conditional:

-((int)r&1) | 1

[redacted]

What the heck, I'll throw this in:

-1 + ((r & 1) << 1)

The problem with that is that the result is converted to a floating
point value which is a rather slow operation.

That's only a problem if it is a problem in real code. You haven't
shown any situation where it would be an issue.

In common real-world situations where you have a variable that is set to
-1 or 1 (integer or floating point), the next thing you will do is use
that to multiply other data which you will then accumulate. If the
-1/+1 expression is written simply, the compiler may optimise it away
entirely - turning things into a conditional code that either adds or
subtracts the later data. With your mess, it won't do that - it will
have to use real multiplication and miss out on optimisation opportunities.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bonita Montero on Thu May 2 16:00:36 2024

On 01/05/2024 19:12, Bonita Montero wrote:

Am 01.05.2024 um 18:46 schrieb David Brown:

That's the kind of thing that is at least vaguely relevant here. The
number of instructions means little (especially when it is not clear
that your code involves fewer bytes) - the time taken for these
instructions is the important thing for claims of "most performant" code.

Every instruction for my code has a single cycle latency.

Surely you know that is not sufficient to justify claims of performance?
Latency is certainly a factor, but so is scheduling, parallel
execution, bandwidth for instruction caches and queues, pipeline hazards
and result forwarding, and a dozen other factors.

Marcel's code
has single cycle latencies except for the last instruction, which takes
seven cyles on my Zen4-CPU. Take the numbers from Agner org, they're
similar for all modern x86-incarnations.

But even with that, all you've got is a claim that one obscure
expression might be slightly faster than some other obscure expression,

Optimized code is often less readable.

That makes it bad code. Sometimes it is acceptable because you need the
very fastest results for a piece of code, but that is very rare. This
is why I am asking you to justify the performance of your code in
practice - justify that it actually /is/ faster, and justify that it is /usefully/ faster. Without that, it is smart-arse code, not smart code.

(If you are just playing around for fun and the challenge of making such
code sequences, that's fine.)

What you need to do is find some reason for wanting the original
expression, where you need to evaluate it vast numbers of times.

My code takes three clock cycles on all modern x86-CPUs.

That is almost as far from answering my question as it is possible to get.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Mueller@21:1/5 to All on Fri May 3 07:32:20 2024

Am 01.05.24 um 17:14 schrieb Bonita Montero:

And I've seen that cvtsi2sdl has a six cylcle latency on my
Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.

You are right, int to float conversions are not that optimized. And they require a dynamic shift operation for the integral logarithm of base 2.

So the only optimization you can do is r&1 ? -1. : 1. to avoid the
division. The compiler may optimize this to a conditional store.

Marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bonita Montero on Fri May 3 11:16:49 2024

On 02/05/2024 16:25, Bonita Montero wrote:

Am 02.05.2024 um 16:00 schrieb David Brown:

Surely you know that is not sufficient to justify claims of
performance? Latency is certainly a factor, but so is scheduling,
parallel execution, bandwidth for instruction caches and queues,
pipeline hazards and result forwarding, and a dozen other factors.

Latency is the time until there's a result. Less instructions let other instructions more likely occupy free execution units. So this can be considered in the reduced way I did.

I know what "latency" means, and why it is only one of many factors to
consider when looking at performance.

That makes it bad code. ...

Please stop snipping relevant parts of posts - such as the reason why
your expression is bad code.

Absolutely not. A Bubblesort is easier to read than a qucksort but no
one woul chose Bubblesort for readability. And for me such tricks are readable but they overburden you.

/You/ were the one who said your code was less readable. (No one else
had to say it because it is so obvious.)

You posted an expression that is an incomprehensible alternative to a comprehensible but apparently useless expression. You are unable to demonstrate that your re-write is actually faster in practice - you can
merely show that compilers generate fewer instructions for it than a
different variation of the original expression. You can't give any uses
of the expression, nor any measurements showing it is better in
practice. Without seeing the effect in context, your code is utterly
pointless and worse than useless.

It's just smart-arse coding - the kind that impresses new programmers
that haven't learned how to write code properly.

If you want to show anyone that your expression is actually a good idea,
you know what you have to do to demonstrate it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bonita Montero on Fri May 3 14:38:58 2024

On 03/05/2024 13:35, Bonita Montero wrote:

Am 03.05.2024 um 11:16 schrieb David Brown:

On 02/05/2024 16:25, Bonita Montero wrote:

Am 02.05.2024 um 16:00 schrieb David Brown:

Surely you know that is not sufficient to justify claims of
performance? Latency is certainly a factor, but so is scheduling,
parallel execution, bandwidth for instruction caches and queues,
pipeline hazards and result forwarding, and a dozen other factors.

Latency is the time until there's a result. Less instructions let other
instructions more likely occupy free execution units. So this can be
considered in the reduced way I did.

I know what "latency" means, and why it is only one of many factors to
consider when looking at performance.

Latency is the time until there's a result, and this matters.

We agree it matters - your mistake is thinking it is the only thing that matters.

That makes it bad code. ...

Please stop snipping relevant parts of posts - such as the reason why
your expression is bad code.

It's currently the fastest solution on x86 for this purpose.

The only purpose you've demonstrated so far is for your bragging rights
as the biggest smart-arse in the group. (I know English is not your
first language, but to be clear, this is not a compliment.)

So this isn't bad code.

Of course it is bad code - it is terrible, and would be rejected by any
code review for normal use. Obscure micro-optimisations like that are
worth considering in two situations - in compilers as code they might
generate for optimisations, and buried deep in very high performance
libraries as part of critical code. You are doing neither.

Absolutely not. A Bubblesort is easier to read than a qucksort but no
one woul chose Bubblesort for readability. And for me such tricks are
readable but they overburden you.

/You/ were the one who said your code was less readable. (No one else
had to say it because it is so obvious.)

You posted an expression that is an incomprehensible alternative to a
comprehensible but apparently useless expression. You are unable to
demonstrate that your re-write is actually faster in practice - you
can merely show that compilers generate fewer instructions for it than
a different variation of the original expression. ...

Compilers don't generate faster code for that. Try it on godbolt.

Godbolt does not show how fast code is any more than you do.

Do you understand the importance of context in code performance? Do you understand about the interaction between different parts of code? Or
that compilers are not dumb translators handling single expressions in isolation? Have you any idea at all about what is /relevant/ in real programming?

I realise that it is cool and clever to have figured out a trick for
this expression. I've nothing against that - it's fun, and fun is good.
If that's all you are trying to show, then that's fine. But if you
want to claim it is "good" code, or useful, or makes a difference, then
you have to show that. At the moment, what you have is of no use in programming.

You can't give any uses of the expression, ...

I've one use for that and I extracted the code.

Then show the rest of the code, along with the timing comparisons I
suggested.

It's just smart-arse coding - the kind that impresses new programmers
that haven't learned how to write code properly.

You're overburdened with that style of code and you want to discuss
it away.

If you want to show anyone that your expression is actually a good
idea, you know what you have to do to demonstrate it.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marcel Mueller@21:1/5 to All on Fri May 3 16:46:28 2024

Am 03.05.24 um 08:37 schrieb Bonita Montero:

Am 03.05.2024 um 07:32 schrieb Marcel Mueller:

So the only optimization you can do is r&1 ? -1. : 1. to avoid the
division. The compiler may optimize this to a conditional store.

There are currently no conditional moves for floating point values
with x86.

Interesting. The GPU of a Raspberry Pi Model 1 can do Job. :-)

Marcel

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Brown@21:1/5 to Bonita Montero on Fri May 3 17:16:59 2024

On 03/05/2024 14:51, Bonita Montero wrote:

Am 03.05.2024 um 14:38 schrieb David Brown:

We agree it matters - your mistake is thinking it is the only thing
that matters.

Of course it is bad code - it is terrible, and would be rejected by
any code review for normal use. ...

Look at the glibc math-Code, it's doing such tricks all over the lib.

There is not much point discussing anything with you until you learn the
basics of communication.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	63:24:23
Calls:	12,446
Calls today:	1
Files:	15,192
Messages:	6,537,529

Re: Most performant way to have "r % 2 ? -1.0 : 1.0"

Who's Online

Recent Visitors

System Info