• eventcount

    From jseigh@21:1/5 to All on Fri Jan 31 08:48:44 2025
    Eventcounts are nifty. I can get a 25x performance improvement using eventcounts over the standard mutex/cvar based solution.

    They're pretty trivial to implement. But since they were only invented
    in 1979, probably too soon to hope they would be in the standard lib in
    the foreseeable future.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paavo Helde@21:1/5 to jseigh on Fri Jan 31 20:49:40 2025
    On 31.01.2025 15:48, jseigh wrote:
    Eventcounts are nifty.  I can get a 25x performance improvement using eventcounts over the standard mutex/cvar based solution.

    They're pretty trivial to implement.  But since they were only invented
    in 1979, probably too soon to hope they would be in the standard lib in
    the foreseeable future.

    I'm not an expert in this field, but at first glance the eventcounts
    seem similar to std::atomic<T>::wait(), std::atomic<T>::notify_*() (in
    the standard since C++20).

    So, would it be possible to compare eventcounts with c++20 std::atomic (functionality and performance wise)?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Paavo Helde on Fri Jan 31 15:47:57 2025
    On 1/31/25 13:49, Paavo Helde wrote:
    On 31.01.2025 15:48, jseigh wrote:
    Eventcounts are nifty.  I can get a 25x performance improvement using
    eventcounts over the standard mutex/cvar based solution.

    They're pretty trivial to implement.  But since they were only invented
    in 1979, probably too soon to hope they would be in the standard lib in
    the foreseeable future.

    I'm not an expert in this field, but at first glance the eventcounts
    seem similar to std::atomic<T>::wait(), std::atomic<T>::notify_*() (in
    the standard since C++20).

    So, would it be possible to compare eventcounts with c++20 std::atomic (functionality and performance wise)?

    You can compare it but it wouldn't be a fair comparison. The eventcount
    api let's you do optimizations that std::atomic can't even if it has
    platform support like linux futex, windows WaitOnAddress, etc..

    E.g. folly's eventcount api
    void notify() noexcept;
    void notifyAll() noexcept;
    Key prepareWait() noexcept;
    void cancelWait() noexcept;
    void wait(Key key) noexcept;

    let's you optimize notify to basically do nothing if there
    are no waiters.

    For std::atomic, you'd have to do an atomic increment and
    a system call on every single notify.

    And if there's no platform support for <T> then the implementation
    is going to use mutex and condvars.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Sun Feb 2 10:07:45 2025
    On 1/31/25 15:47, jseigh wrote:
    On 1/31/25 13:49, Paavo Helde wrote:
    On 31.01.2025 15:48, jseigh wrote:
    Eventcounts are nifty.  I can get a 25x performance improvement using
    eventcounts over the standard mutex/cvar based solution.

    They're pretty trivial to implement.  But since they were only invented >>> in 1979, probably too soon to hope they would be in the standard lib in
    the foreseeable future.

    I'm not an expert in this field, but at first glance the eventcounts
    seem similar to std::atomic<T>::wait(), std::atomic<T>::notify_*() (in
    the standard since C++20).

    So, would it be possible to compare eventcounts with c++20 std::atomic
    (functionality and performance wise)?

    You can compare it but it wouldn't be a fair comparison.  The eventcount
    api let's you do optimizations that std::atomic can't even if it has
    platform support like linux futex, windows WaitOnAddress, etc..

    E.g. folly's eventcount api
      void notify() noexcept;
      void notifyAll() noexcept;
      Key prepareWait() noexcept;
      void cancelWait() noexcept;
      void wait(Key key) noexcept;

    let's you optimize notify to basically do nothing if there
    are no waiters.

    For std::atomic, you'd have to do an atomic increment and
    a system call on every single notify.

    And if there's no platform support for <T> then the implementation
    is going to use mutex and condvars.


    I did a comparison to std::atomic<uint32_t> on linux which is probably
    using a futex. It's about 2x to 3x slower than using eventcounts.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to jseigh on Sun Feb 2 17:58:52 2025
    jseigh <[email protected]> writes:
    On 1/31/25 15:47, jseigh wrote:
    On 1/31/25 13:49, Paavo Helde wrote:
    On 31.01.2025 15:48, jseigh wrote:
    Eventcounts are nifty.  I can get a 25x performance improvement using >>>> eventcounts over the standard mutex/cvar based solution.

    They're pretty trivial to implement.  But since they were only invented >>>> in 1979, probably too soon to hope they would be in the standard lib in >>>> the foreseeable future.

    I'm not an expert in this field, but at first glance the eventcounts
    seem similar to std::atomic<T>::wait(), std::atomic<T>::notify_*() (in
    the standard since C++20).

    So, would it be possible to compare eventcounts with c++20 std::atomic
    (functionality and performance wise)?

    You can compare it but it wouldn't be a fair comparison.  The eventcount
    api let's you do optimizations that std::atomic can't even if it has
    platform support like linux futex, windows WaitOnAddress, etc..

    E.g. folly's eventcount api
      void notify() noexcept;
      void notifyAll() noexcept;
      Key prepareWait() noexcept;
      void cancelWait() noexcept;
      void wait(Key key) noexcept;

    let's you optimize notify to basically do nothing if there
    are no waiters.

    For std::atomic, you'd have to do an atomic increment and
    a system call on every single notify.

    And if there's no platform support for <T> then the implementation
    is going to use mutex and condvars.


    I did a comparison to std::atomic<uint32_t> on linux which is probably
    using a futex. It's about 2x to 3x slower than using eventcounts.

    Try it with -O2 or -O3.

    #include <atomic>

    int
    main(int argc, const char **argv, const char **envp, const char **auxv)
    {

    std::atomic<uint32_t> variable;

    variable = 0x13515;

    return variable;
    }

    Disassembly of section .text:

    0000000000401020 <main>:
    401020: b8 15 35 01 00 mov $0x13515,%eax
    401025: 87 44 24 fc xchg %eax,-0x4(%rsp)
    401029: 8b 44 24 fc mov -0x4(%rsp),%eax
    40102d: c3 ret
    40102e: 66 90 xchg %ax,%ax

    Without optimization, g++ generates library calls
    for the atomic accesses.

    0000000000401106 <main>:
    401106: 55 push %rbp
    401107: 48 89 e5 mov %rsp,%rbp
    40110a: 48 83 ec 30 sub $0x30,%rsp
    40110e: 89 7d ec mov %edi,-0x14(%rbp)
    401111: 48 89 75 e0 mov %rsi,-0x20(%rbp)
    401115: 48 89 55 d8 mov %rdx,-0x28(%rbp)
    401119: 48 89 4d d0 mov %rcx,-0x30(%rbp)
    40111d: 48 8d 45 fc lea -0x4(%rbp),%rax
    401121: be 15 35 01 00 mov $0x13515,%esi
    401126: 48 89 c7 mov %rax,%rdi
    401129: e8 22 00 00 00 call 401150 <std::__atomic_base<unsigned int>::operator=(unsigned int)>
    40112e: 48 8d 45 fc lea -0x4(%rbp),%rax
    401132: 48 89 c7 mov %rax,%rdi
    401135: e8 9e 00 00 00 call 4011d8 <std::__atomic_base<unsigned int>::operator unsigned int() const>
    40113a: c9 leave
    40113b: c3 ret

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Bonita Montero on Tue Feb 4 19:59:46 2025
    On 2/4/25 12:33, Bonita Montero wrote:
    Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:
    On 2/3/2025 8:25 PM, Bonita Montero wrote:
    Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

    Sigh.

    These poll-only lock-free algorithms don't make sense.

    Why do you assume poll-only?

    Lock-free don't has kernel-waits.


    The underlying queue is lock-free. Adding the eventcounts means
    that it isn't technically lock-free any more. It's just a concurrent
    queue. But one that's 25x faster than a concurrent queue
    using mutexes and condvars. You are free to use the slower
    implementations. Who's to know the difference if you don't tell
    them.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From [email protected]@21:1/5 to All on Wed Feb 5 08:38:36 2025
    On Tue, 4 Feb 2025 19:59:46 -0500
    jseigh <[email protected]> wibbled:
    On 2/4/25 12:33, Bonita Montero wrote:
    Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:
    On 2/3/2025 8:25 PM, Bonita Montero wrote:
    Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

    Sigh.

    These poll-only lock-free algorithms don't make sense.

    Why do you assume poll-only?

    Lock-free don't has kernel-waits.


    The underlying queue is lock-free. Adding the eventcounts means
    that it isn't technically lock-free any more. It's just a concurrent
    queue. But one that's 25x faster than a concurrent queue
    using mutexes and condvars. You are free to use the slower
    implementations. Who's to know the difference if you don't tell
    them.

    The kernel could implement these in any way it sees fit, so long as it stops and restarts the user space program at the correct time in, correct state with the correct data. Underlying implementations should always be black box to
    user space.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to [email protected] on Wed Feb 5 06:59:41 2025
    On 2/5/25 03:38, [email protected] wrote:
    On Tue, 4 Feb 2025 19:59:46 -0500
    jseigh <[email protected]> wibbled:
    On 2/4/25 12:33, Bonita Montero wrote:
    Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:
    On 2/3/2025 8:25 PM, Bonita Montero wrote:
    Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

    Sigh.

    These poll-only lock-free algorithms don't make sense.

    Why do you assume poll-only?

    Lock-free don't has kernel-waits.


    The underlying queue is lock-free. Adding the eventcounts means
    that it isn't technically lock-free any more. It's just a concurrent
    queue. But one that's 25x faster than a concurrent queue
    using mutexes and condvars. You are free to use the slower
    implementations. Who's to know the difference if you don't tell
    them.

    The kernel could implement these in any way it sees fit, so long as it stops and restarts the user space program at the correct time in, correct state with
    the correct data. Underlying implementations should always be black box to user space.


    No argument there. Not entirely sure how this is germane to user
    space code. Please clarify a bit more.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From [email protected]@21:1/5 to All on Wed Feb 5 13:30:27 2025
    On Wed, 5 Feb 2025 06:59:41 -0500
    jseigh <[email protected]> wibbled:
    On 2/5/25 03:38, [email protected] wrote:
    On Tue, 4 Feb 2025 19:59:46 -0500
    jseigh <[email protected]> wibbled:
    On 2/4/25 12:33, Bonita Montero wrote:
    Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:
    On 2/3/2025 8:25 PM, Bonita Montero wrote:
    Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

    Sigh.

    These poll-only lock-free algorithms don't make sense.

    Why do you assume poll-only?

    Lock-free don't has kernel-waits.


    The underlying queue is lock-free. Adding the eventcounts means
    that it isn't technically lock-free any more. It's just a concurrent
    queue. But one that's 25x faster than a concurrent queue
    using mutexes and condvars. You are free to use the slower
    implementations. Who's to know the difference if you don't tell
    them.

    The kernel could implement these in any way it sees fit, so long as it stops >> and restarts the user space program at the correct time in, correct state >with
    the correct data. Underlying implementations should always be black box to >> user space.


    No argument there. Not entirely sure how this is germane to user
    space code. Please clarify a bit more.

    Did you miss the bits above mentioning kernel-waits and underlying implementations?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to [email protected] on Wed Feb 5 12:15:18 2025
    On 2/5/25 08:30, [email protected] wrote:
    On Wed, 5 Feb 2025 06:59:41 -0500
    jseigh <[email protected]> wibbled:
    On 2/5/25 03:38, [email protected] wrote:
    On Tue, 4 Feb 2025 19:59:46 -0500
    jseigh <[email protected]> wibbled:
    On 2/4/25 12:33, Bonita Montero wrote:
    Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:
    On 2/3/2025 8:25 PM, Bonita Montero wrote:
    Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

    Sigh.

    These poll-only lock-free algorithms don't make sense.

    Why do you assume poll-only?

    Lock-free don't has kernel-waits.


    The underlying queue is lock-free. Adding the eventcounts means
    that it isn't technically lock-free any more. It's just a concurrent
    queue. But one that's 25x faster than a concurrent queue
    using mutexes and condvars. You are free to use the slower
    implementations. Who's to know the difference if you don't tell
    them.

    The kernel could implement these in any way it sees fit, so long as it stops
    and restarts the user space program at the correct time in, correct state >> with
    the correct data. Underlying implementations should always be black box to >>> user space.


    No argument there. Not entirely sure how this is germane to user
    space code. Please clarify a bit more.

    Did you miss the bits above mentioning kernel-waits and underlying implementations?


    Ok, I see the source of your confusion. Eventcounts and mutex/condvars
    are 2 different synchronization mechanisms. On linux they both use
    futexes. The mutex/condvar is slower not because of its implementation
    but because there is a lock involved. If a thread time slice ends
    with it holding the lock, other threads get blocked. With eventcounts
    it doesn't matter if and where its time slice end happens because it
    won't affect the progress of other threads. That's basically where
    the 25x difference in performance is.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From [email protected]@21:1/5 to All on Thu Feb 6 08:19:29 2025
    On Wed, 5 Feb 2025 12:15:18 -0500
    jseigh <[email protected]> wibbled:
    On 2/5/25 08:30, [email protected] wrote:
    Did you miss the bits above mentioning kernel-waits and underlying
    implementations?


    Ok, I see the source of your confusion. Eventcounts and mutex/condvars
    are 2 different synchronization mechanisms. On linux they both use
    futexes. The mutex/condvar is slower not because of its implementation
    but because there is a lock involved. If a thread time slice ends
    with it holding the lock, other threads get blocked. With eventcounts
    it doesn't matter if and where its time slice end happens because it
    won't affect the progress of other threads. That's basically where
    the 25x difference in performance is.

    Presumably however a lock must be involved somewhere even if in kernel space when waiting on a futex/eventcount otherwise you risk race conditions. Also
    I don't understand people talking about polling - if polling were a solution
    to thread synchronisation then mutexes wouldn't exist.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to [email protected] on Thu Feb 6 11:53:44 2025
    On 2/6/25 03:19, [email protected] wrote:
    On Wed, 5 Feb 2025 12:15:18 -0500
    jseigh <[email protected]> wibbled:
    On 2/5/25 08:30, [email protected] wrote:
    Did you miss the bits above mentioning kernel-waits and underlying
    implementations?


    Ok, I see the source of your confusion. Eventcounts and mutex/condvars
    are 2 different synchronization mechanisms. On linux they both use
    futexes. The mutex/condvar is slower not because of its implementation
    but because there is a lock involved. If a thread time slice ends
    with it holding the lock, other threads get blocked. With eventcounts
    it doesn't matter if and where its time slice end happens because it
    won't affect the progress of other threads. That's basically where
    the 25x difference in performance is.

    Presumably however a lock must be involved somewhere even if in kernel space when waiting on a futex/eventcount otherwise you risk race conditions. Also

    The kernel is non-preemptive. It typically uses short term locks, spin
    locks mostly, to protect kernel data structures. futex wait is only
    going to hold a lock long enough to put the waiting user thread on a
    wait queue. A thread in futex wait won't block any other user threads.
    If another thread does a wake on the futex, the kernal will get its
    internal lock, select the appropriate number of threads to wake,
    remove them from the futex wait queue and put them in the scheduler
    run queue.


    I don't understand people talking about polling - if polling were a solution to thread synchronisation then mutexes wouldn't exist.

    Somebody thinks there is polling/busy waiting involved. There isn't.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Sun Feb 9 20:57:11 2025
    On 2/2/25 10:07, jseigh wrote:
    On 1/31/25 15:47, jseigh wrote:



    I did a comparison to std::atomic<uint32_t> on linux which is probably
    using a futex.  It's about 2x to 3x slower than using eventcounts.


    I also tried with counting semaphores. Also about 3x slower than
    eventcounts but slightly faster than std::atomic.

    A little bit hacky getting semaphores to work with closeable queues
    but all good.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)