• Re: Most performant way to have "r % 2 ? -1.0 : 1.0"

    From Marcel Mueller@21:1/5 to All on Wed May 1 16:11:14 2024
    Am 30.04.24 um 11:34 schrieb Bonita Montero:
    bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

    There is no need for conditional:

    -((int)r&1) | 1


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to Bonita Montero on Wed May 1 15:39:49 2024
    Bonita Montero <[email protected]> writes:
    Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
    Am 30.04.24 um 11:34 schrieb Bonita Montero:
    bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

    There is no need for conditional:

    -((int)r&1) | 1


    Marcel

    Looks simpler, but your code is one instruction more:

    my code:
    movabsq $4607182418800017408, %rax
    salq $63, %rdi
    orq %rax, %rdi
    movq %rdi, %xmm0

    your code:
    andl $1, %edi
    pxor %xmm0, %xmm0
    negl %edi
    orl $1, %edi
    cvtsi2sdl %edi, %xmm0

    the pxor is "executed" in the fetch stage, so adds no latency
    (it is effectively a NOP).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bonita Montero on Wed May 1 18:46:31 2024
    On 01/05/2024 17:14, Bonita Montero wrote:
    Am 01.05.2024 um 17:05 schrieb Bonita Montero:
    Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
    Am 30.04.24 um 11:34 schrieb Bonita Montero:
    bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

    There is no need for conditional:

    -((int)r&1) | 1


    Marcel

    Looks simpler, but your code is one instruction more:

    my code:
             movabsq $4607182418800017408, %rax
             salq    $63, %rdi
             orq     %rax, %rdi
             movq    %rdi, %xmm0

    your code:
             andl    $1, %edi
             pxor    %xmm0, %xmm0
             negl    %edi
             orl     $1, %edi
             cvtsi2sdl       %edi, %xmm0

    And I've seen that cvtsi2sdl has a six cylcle latency on my
    Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.


    That's the kind of thing that is at least vaguely relevant here. The
    number of instructions means little (especially when it is not clear
    that your code involves fewer bytes) - the time taken for these
    instructions is the important thing for claims of "most performant" code.

    But even with that, all you've got is a claim that one obscure
    expression might be slightly faster than some other obscure expression,
    on some particular machine. No one cares if a program runs a few
    nanoseconds faster - certainly not enough to accept a re-write like
    yours (or Marcel's).

    What you need to do is find some reason for wanting the original
    expression, where you need to evaluate it vast numbers of times.
    Perhaps use it in a big vector calculation or DSP algorithm. Write the
    full code out. Time it on a real machine, first with the original
    expression (from the subject line), then with your version, then with
    Marcel's. Compile with a good optimising compiler (such gcc or clang,
    not MSVC) with vectorisation appropriate for the processor you are using
    (use "-march=native"). Time the differences.

    My guess is that in real code, the difference will be negligible, but
    that the original simple and clear expression has the edge because it
    lets the compiler do more optimisations.

    Feel free to prove my guess wrong - it is, after all, just a guess.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From red floyd@21:1/5 to David Brown on Wed May 1 18:23:15 2024
    On 5/1/2024 9:46 AM, David Brown wrote:
    On 01/05/2024 17:14, Bonita Montero wrote:
    Am 01.05.2024 um 17:05 schrieb Bonita Montero:
    Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
    Am 30.04.24 um 11:34 schrieb Bonita Montero:
    bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

    There is no need for conditional:

    -((int)r&1) | 1


    [redacted]


    What the heck, I'll throw this in:

    -1 + ((r & 1) << 1)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From red floyd@21:1/5 to red floyd on Wed May 1 18:25:19 2024
    On 5/1/2024 6:23 PM, red floyd wrote:
    On 5/1/2024 9:46 AM, David Brown wrote:
    On 01/05/2024 17:14, Bonita Montero wrote:
    Am 01.05.2024 um 17:05 schrieb Bonita Montero:
    Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
    Am 30.04.24 um 11:34 schrieb Bonita Montero:
    bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

    There is no need for conditional:

    -((int)r&1) | 1


    [redacted]


    What the heck, I'll throw this in:

    -1 + ((r & 1) << 1)

    Oops. Sign reversal.

    1 - ((r & 1) << 1)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bonita Montero on Thu May 2 16:04:37 2024
    On 02/05/2024 05:49, Bonita Montero wrote:
    Am 02.05.2024 um 03:23 schrieb red floyd:
    On 5/1/2024 9:46 AM, David Brown wrote:
    On 01/05/2024 17:14, Bonita Montero wrote:
    Am 01.05.2024 um 17:05 schrieb Bonita Montero:
    Am 01.05.2024 um 16:11 schrieb Marcel Mueller:
    Am 30.04.24 um 11:34 schrieb Bonita Montero:
    bit_cast<double>( (uint64_t)r << 63 | 0x3FFull << 52 )

    There is no need for conditional:

    -((int)r&1) | 1


    [redacted]


    What the heck, I'll throw this in:

    -1 + ((r & 1) << 1)


    The problem with that is that the result is converted to a floating
    point value which is a rather slow operation.

    That's only a problem if it is a problem in real code. You haven't
    shown any situation where it would be an issue.

    In common real-world situations where you have a variable that is set to
    -1 or 1 (integer or floating point), the next thing you will do is use
    that to multiply other data which you will then accumulate. If the
    -1/+1 expression is written simply, the compiler may optimise it away
    entirely - turning things into a conditional code that either adds or
    subtracts the later data. With your mess, it won't do that - it will
    have to use real multiplication and miss out on optimisation opportunities.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bonita Montero on Thu May 2 16:00:36 2024
    On 01/05/2024 19:12, Bonita Montero wrote:
    Am 01.05.2024 um 18:46 schrieb David Brown:

    That's the kind of thing that is at least vaguely relevant here.  The
    number of instructions means little (especially when it is not clear
    that your code involves fewer bytes) - the time taken for these
    instructions is the important thing for claims of "most performant" code.

    Every instruction for my code has a single cycle latency.

    Surely you know that is not sufficient to justify claims of performance?
    Latency is certainly a factor, but so is scheduling, parallel
    execution, bandwidth for instruction caches and queues, pipeline hazards
    and result forwarding, and a dozen other factors.

    Marcel's code
    has single cycle latencies except for the last instruction, which takes
    seven cyles on my Zen4-CPU. Take the numbers from Agner org, they're
    similar for all modern x86-incarnations.

    But even with that, all you've got is a claim that one obscure
    expression might be slightly faster than some other obscure expression,

    Optimized code is often less readable.

    That makes it bad code. Sometimes it is acceptable because you need the
    very fastest results for a piece of code, but that is very rare. This
    is why I am asking you to justify the performance of your code in
    practice - justify that it actually /is/ faster, and justify that it is /usefully/ faster. Without that, it is smart-arse code, not smart code.

    (If you are just playing around for fun and the challenge of making such
    code sequences, that's fine.)


    What you need to do is find some reason for wanting the original
    expression, where you need to evaluate it vast numbers of times.

    My code takes three clock cycles on all modern x86-CPUs.


    That is almost as far from answering my question as it is possible to get.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Mueller@21:1/5 to All on Fri May 3 07:32:20 2024
    Am 01.05.24 um 17:14 schrieb Bonita Montero:
    And I've seen that cvtsi2sdl has a six cylcle latency on my
    Zen4-CPU whereas the movq rdi, xmm0 only takes one clock cycle.

    You are right, int to float conversions are not that optimized. And they require a dynamic shift operation for the integral logarithm of base 2.

    So the only optimization you can do is r&1 ? -1. : 1. to avoid the
    division. The compiler may optimize this to a conditional store.


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bonita Montero on Fri May 3 11:16:49 2024
    On 02/05/2024 16:25, Bonita Montero wrote:
    Am 02.05.2024 um 16:00 schrieb David Brown:

    Surely you know that is not sufficient to justify claims of
    performance?   Latency is certainly a factor, but so is scheduling,
    parallel execution, bandwidth for instruction caches and queues,
    pipeline hazards and result forwarding, and a dozen other factors.

    Latency is the time until there's a result. Less instructions let other instructions more likely occupy free execution units. So this can be considered in the reduced way I did.


    I know what "latency" means, and why it is only one of many factors to
    consider when looking at performance.


    That makes it bad code. ...


    Please stop snipping relevant parts of posts - such as the reason why
    your expression is bad code.

    Absolutely not. A Bubblesort is easier to read than a qucksort but no
    one woul chose Bubblesort for readability. And for me such tricks are readable but they overburden you.


    /You/ were the one who said your code was less readable. (No one else
    had to say it because it is so obvious.)

    You posted an expression that is an incomprehensible alternative to a comprehensible but apparently useless expression. You are unable to demonstrate that your re-write is actually faster in practice - you can
    merely show that compilers generate fewer instructions for it than a
    different variation of the original expression. You can't give any uses
    of the expression, nor any measurements showing it is better in
    practice. Without seeing the effect in context, your code is utterly
    pointless and worse than useless.

    It's just smart-arse coding - the kind that impresses new programmers
    that haven't learned how to write code properly.

    If you want to show anyone that your expression is actually a good idea,
    you know what you have to do to demonstrate it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bonita Montero on Fri May 3 14:38:58 2024
    On 03/05/2024 13:35, Bonita Montero wrote:
    Am 03.05.2024 um 11:16 schrieb David Brown:
    On 02/05/2024 16:25, Bonita Montero wrote:
    Am 02.05.2024 um 16:00 schrieb David Brown:

    Surely you know that is not sufficient to justify claims of
    performance?   Latency is certainly a factor, but so is scheduling,
    parallel execution, bandwidth for instruction caches and queues,
    pipeline hazards and result forwarding, and a dozen other factors.

    Latency is the time until there's a result. Less instructions let other
    instructions more likely occupy free execution units. So this can be
    considered in the reduced way I did.


    I know what "latency" means, and why it is only one of many factors to
    consider when looking at performance.

    Latency is the time until there's a result, and this matters.


    We agree it matters - your mistake is thinking it is the only thing that matters.



    That makes it bad code. ...


    Please stop snipping relevant parts of posts - such as the reason why
    your expression is bad code.

    It's currently the fastest solution on x86 for this purpose.

    The only purpose you've demonstrated so far is for your bragging rights
    as the biggest smart-arse in the group. (I know English is not your
    first language, but to be clear, this is not a compliment.)

    So this isn't bad code.


    Of course it is bad code - it is terrible, and would be rejected by any
    code review for normal use. Obscure micro-optimisations like that are
    worth considering in two situations - in compilers as code they might
    generate for optimisations, and buried deep in very high performance
    libraries as part of critical code. You are doing neither.



    Absolutely not. A Bubblesort is easier to read than a qucksort but no
    one woul chose Bubblesort for readability. And for me such tricks are
    readable but they overburden you.


    /You/ were the one who said your code was less readable.  (No one else
    had to say it because it is so obvious.)

    You posted an expression that is an incomprehensible alternative to a
    comprehensible but apparently useless expression.  You are unable to
    demonstrate that your re-write is actually faster in practice - you
    can merely show that compilers generate fewer instructions for it than
    a different variation of the original expression. ...

    Compilers don't generate faster code for that. Try it on godbolt.

    Godbolt does not show how fast code is any more than you do.

    Do you understand the importance of context in code performance? Do you understand about the interaction between different parts of code? Or
    that compilers are not dumb translators handling single expressions in isolation? Have you any idea at all about what is /relevant/ in real programming?

    I realise that it is cool and clever to have figured out a trick for
    this expression. I've nothing against that - it's fun, and fun is good.
    If that's all you are trying to show, then that's fine. But if you
    want to claim it is "good" code, or useful, or makes a difference, then
    you have to show that. At the moment, what you have is of no use in programming.



    You can't give any uses of the expression, ...

    I've one use for that and I extracted the code.

    Then show the rest of the code, along with the timing comparisons I
    suggested.


    It's just smart-arse coding - the kind that impresses new programmers
    that haven't learned how to write code properly.

    You're overburdened with that style of code and you want to discuss
    it away.

    If you want to show anyone that your expression is actually a good
    idea, you know what you have to do to demonstrate it.




    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marcel Mueller@21:1/5 to All on Fri May 3 16:46:28 2024
    Am 03.05.24 um 08:37 schrieb Bonita Montero:
    Am 03.05.2024 um 07:32 schrieb Marcel Mueller:

    So the only optimization you can do is r&1 ? -1. : 1. to avoid the
    division. The compiler may optimize this to a conditional store.

    There are currently no conditional moves for floating point values
    with x86.

    Interesting. The GPU of a Raspberry Pi Model 1 can do Job. :-)


    Marcel

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Brown@21:1/5 to Bonita Montero on Fri May 3 17:16:59 2024
    On 03/05/2024 14:51, Bonita Montero wrote:
    Am 03.05.2024 um 14:38 schrieb David Brown:

    We agree it matters - your mistake is thinking it is the only thing
    that matters.

    Of course it is bad code - it is terrible, and would be rejected by
    any code review for normal use. ...

    Look at the glibc math-Code, it's doing such tricks all over the lib.

    There is not much point discussing anything with you until you learn the
    basics of communication.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)