bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
Am 30.04.24 um 11:34 schrieb Bonita Montero:
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
There is no need for conditional:
-((int)r&1) | 1
Marcel
Looks simpler, but your code is one instruction more:
my code:
movabsq $4607182418800017408, %rax
salq $63, %rdi
orq %rax, %rdi
movq %rdi, %xmm0
your code:
andl $1, %edi
pxor %xmm0, %xmm0
negl %edi
orl $1, %edi
cvtsi2sdl %edi, %xmm0
Am 01.05.2024 um 17:05 schrieb Bonita Montero:
Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
Am 30.04.24 um 11:34 schrieb Bonita Montero:
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
There is no need for conditional:
-((int)r&1) | 1
Marcel
Looks simpler, but your code is one instruction more:
my code:
movabsq $4607182418800017408, %rax
salq $63, %rdi
orq %rax, %rdi
movq %rdi, %xmm0
your code:
andl $1, %edi
pxor %xmm0, %xmm0
negl %edi
orl $1, %edi
cvtsi2sdl %edi, %xmm0
And I've seen that cvtsi2sdl has a six cylcle latency on my
Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.
On 01/05/2024 17:14, Bonita Montero wrote:[redacted]
Am 01.05.2024 um 17:05 schrieb Bonita Montero:
Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
Am 30.04.24 um 11:34 schrieb Bonita Montero:
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
There is no need for conditional:
-((int)r&1) | 1
On 5/1/2024 9:46 AM, David Brown wrote:
On 01/05/2024 17:14, Bonita Montero wrote:[redacted]
Am 01.05.2024 um 17:05 schrieb Bonita Montero:
Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
Am 30.04.24 um 11:34 schrieb Bonita Montero:
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
There is no need for conditional:
-((int)r&1) | 1
What the heck, I'll throw this in:
-1 + ((r & 1) << 1)
Am 02.05.2024 um 03:23 schrieb red floyd:
On 5/1/2024 9:46 AM, David Brown wrote:
On 01/05/2024 17:14, Bonita Montero wrote:[redacted]
Am 01.05.2024 um 17:05 schrieb Bonita Montero:
Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
Am 30.04.24 um 11:34 schrieb Bonita Montero:
bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )
There is no need for conditional:
-((int)r&1) | 1
What the heck, I'll throw this in:
-1 + ((r & 1) << 1)
The problem with that is that the result is converted to a floating
point value which is a rather slow operation.
Am 01.05.2024 um 18:46 schrieb David Brown:
That's the kind of thing that is at least vaguely relevant here. The
number of instructions means little (especially when it is not clear
that your code involves fewer bytes) - the time taken for these
instructions is the important thing for claims of "most performant" code.
Every instruction for my code has a single cycle latency.
Marcel's code
has single cycle latencies except for the last instruction, which takes
seven cyles on my Zen4-CPU. Take the numbers from Agner org, they're
similar for all modern x86-incarnations.
But even with that, all you've got is a claim that one obscure
expression might be slightly faster than some other obscure expression,
Optimized code is often less readable.
What you need to do is find some reason for wanting the original
expression, where you need to evaluate it vast numbers of times.
My code takes three clock cycles on all modern x86-CPUs.
And I've seen that cvtsi2sdl has a six cylcle latency on my
Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.
Am 02.05.2024 um 16:00 schrieb David Brown:
Surely you know that is not sufficient to justify claims of
performance? Latency is certainly a factor, but so is scheduling,
parallel execution, bandwidth for instruction caches and queues,
pipeline hazards and result forwarding, and a dozen other factors.
Latency is the time until there's a result. Less instructions let other instructions more likely occupy free execution units. So this can be considered in the reduced way I did.
That makes it bad code. ...
Absolutely not. A Bubblesort is easier to read than a qucksort but no
one woul chose Bubblesort for readability. And for me such tricks are readable but they overburden you.
Am 03.05.2024 um 11:16 schrieb David Brown:
On 02/05/2024 16:25, Bonita Montero wrote:
Am 02.05.2024 um 16:00 schrieb David Brown:
Surely you know that is not sufficient to justify claims of
performance? Latency is certainly a factor, but so is scheduling,
parallel execution, bandwidth for instruction caches and queues,
pipeline hazards and result forwarding, and a dozen other factors.
Latency is the time until there's a result. Less instructions let other
instructions more likely occupy free execution units. So this can be
considered in the reduced way I did.
I know what "latency" means, and why it is only one of many factors to
consider when looking at performance.
Latency is the time until there's a result, and this matters.
That makes it bad code. ...
Please stop snipping relevant parts of posts - such as the reason why
your expression is bad code.
It's currently the fastest solution on x86 for this purpose.
So this isn't bad code.
Absolutely not. A Bubblesort is easier to read than a qucksort but no
one woul chose Bubblesort for readability. And for me such tricks are
readable but they overburden you.
/You/ were the one who said your code was less readable. (No one else
had to say it because it is so obvious.)
You posted an expression that is an incomprehensible alternative to a
comprehensible but apparently useless expression. You are unable to
demonstrate that your re-write is actually faster in practice - you
can merely show that compilers generate fewer instructions for it than
a different variation of the original expression. ...
Compilers don't generate faster code for that. Try it on godbolt.
You can't give any uses of the expression, ...
I've one use for that and I extracted the code.
It's just smart-arse coding - the kind that impresses new programmers
that haven't learned how to write code properly.
You're overburdened with that style of code and you want to discuss
it away.
If you want to show anyone that your expression is actually a good
idea, you know what you have to do to demonstrate it.
Am 03.05.2024 um 07:32 schrieb Marcel Mueller:
So the only optimization you can do is r&1 ? -1. : 1. to avoid the
division. The compiler may optimize this to a conditional store.
There are currently no conditional moves for floating point values
with x86.
Am 03.05.2024 um 14:38 schrieb David Brown:
We agree it matters - your mistake is thinking it is the only thing
that matters.
Of course it is bad code - it is terrible, and would be rejected by
any code review for normal use. ...
Look at the glibc math-Code, it's doing such tricks all over the lib.
| Sysop: | Keyop |
|---|---|
| Location: | Huddersfield, West Yorkshire, UK |
| Users: | 715 |
| Nodes: | 16 (2 / 14) |
| Uptime: | 06:34:55 |
| Calls: | 12,100 |
| Calls today: | 8 |
| Files: | 15,003 |
| Messages: | 6,517,925 |
| Posted today: | 1 |