Forum: >>> Magnum BBS <<<

eventcount

From jseigh@21:1/5 to All on Fri Jan 31 08:48:44 2025

Eventcounts are nifty. I can get a 25x performance improvement using eventcounts over the standard mutex/cvar based solution.

They're pretty trivial to implement. But since they were only invented
in 1979, probably too soon to hope they would be in the standard lib in
the foreseeable future.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Paavo Helde@21:1/5 to jseigh on Fri Jan 31 20:49:40 2025

On 31.01.2025 15:48, jseigh wrote:

Eventcounts are nifty. I can get a 25x performance improvement using eventcounts over the standard mutex/cvar based solution.

They're pretty trivial to implement. But since they were only invented
in 1979, probably too soon to hope they would be in the standard lib in
the foreseeable future.

I'm not an expert in this field, but at first glance the eventcounts
seem similar to std::atomic<T>::wait(), std::atomic<T>::notify_*() (in
the standard since C++20).

So, would it be possible to compare eventcounts with c++20 std::atomic (functionality and performance wise)?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Paavo Helde on Fri Jan 31 15:47:57 2025

On 1/31/25 13:49, Paavo Helde wrote:

On 31.01.2025 15:48, jseigh wrote:

Eventcounts are nifty. I can get a 25x performance improvement using
eventcounts over the standard mutex/cvar based solution.

They're pretty trivial to implement. But since they were only invented
in 1979, probably too soon to hope they would be in the standard lib in
the foreseeable future.

I'm not an expert in this field, but at first glance the eventcounts
seem similar to std::atomic<T>::wait(), std::atomic<T>::notify_*() (in
the standard since C++20).

So, would it be possible to compare eventcounts with c++20 std::atomic (functionality and performance wise)?

You can compare it but it wouldn't be a fair comparison. The eventcount
api let's you do optimizations that std::atomic can't even if it has
platform support like linux futex, windows WaitOnAddress, etc..

E.g. folly's eventcount api
void notify() noexcept;
void notifyAll() noexcept;
Key prepareWait() noexcept;
void cancelWait() noexcept;
void wait(Key key) noexcept;

let's you optimize notify to basically do nothing if there
are no waiters.

For std::atomic, you'd have to do an atomic increment and
a system call on every single notify.

And if there's no platform support for <T> then the implementation
is going to use mutex and condvars.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Sun Feb 2 10:07:45 2025

On 1/31/25 15:47, jseigh wrote:

On 1/31/25 13:49, Paavo Helde wrote:

On 31.01.2025 15:48, jseigh wrote:

Eventcounts are nifty. I can get a 25x performance improvement using
eventcounts over the standard mutex/cvar based solution.

They're pretty trivial to implement. But since they were only invented >>> in 1979, probably too soon to hope they would be in the standard lib in
the foreseeable future.

I'm not an expert in this field, but at first glance the eventcounts
seem similar to std::atomic<T>::wait(), std::atomic<T>::notify_*() (in
the standard since C++20).

So, would it be possible to compare eventcounts with c++20 std::atomic
(functionality and performance wise)?

You can compare it but it wouldn't be a fair comparison. The eventcount
api let's you do optimizations that std::atomic can't even if it has
platform support like linux futex, windows WaitOnAddress, etc..

E.g. folly's eventcount api
void notify() noexcept;
void notifyAll() noexcept;
Key prepareWait() noexcept;
void cancelWait() noexcept;
void wait(Key key) noexcept;

let's you optimize notify to basically do nothing if there
are no waiters.

For std::atomic, you'd have to do an atomic increment and
a system call on every single notify.

And if there's no platform support for <T> then the implementation
is going to use mutex and condvars.

I did a comparison to std::atomic<uint32_t> on linux which is probably
using a futex. It's about 2x to 3x slower than using eventcounts.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to jseigh on Sun Feb 2 17:58:52 2025

jseigh <[email protected]> writes:

On 1/31/25 15:47, jseigh wrote:

On 1/31/25 13:49, Paavo Helde wrote:

On 31.01.2025 15:48, jseigh wrote:

Eventcounts are nifty. I can get a 25x performance improvement using >>>> eventcounts over the standard mutex/cvar based solution.

They're pretty trivial to implement. But since they were only invented >>>> in 1979, probably too soon to hope they would be in the standard lib in >>>> the foreseeable future.

I'm not an expert in this field, but at first glance the eventcounts
seem similar to std::atomic<T>::wait(), std::atomic<T>::notify_*() (in
the standard since C++20).

So, would it be possible to compare eventcounts with c++20 std::atomic
(functionality and performance wise)?

You can compare it but it wouldn't be a fair comparison. The eventcount
api let's you do optimizations that std::atomic can't even if it has
platform support like linux futex, windows WaitOnAddress, etc..

E.g. folly's eventcount api
void notify() noexcept;
void notifyAll() noexcept;
Key prepareWait() noexcept;
void cancelWait() noexcept;
void wait(Key key) noexcept;

let's you optimize notify to basically do nothing if there
are no waiters.

For std::atomic, you'd have to do an atomic increment and
a system call on every single notify.

And if there's no platform support for <T> then the implementation
is going to use mutex and condvars.

I did a comparison to std::atomic<uint32_t> on linux which is probably
using a futex. It's about 2x to 3x slower than using eventcounts.

Try it with -O2 or -O3.

#include <atomic>

int
main(int argc, const char **argv, const char **envp, const char **auxv)
{

std::atomic<uint32_t> variable;

variable = 0x13515;

return variable;
}

Disassembly of section .text:

0000000000401020 <main>:
401020: b8 15 35 01 00 mov $0x13515,%eax
401025: 87 44 24 fc xchg %eax,-0x4(%rsp)
401029: 8b 44 24 fc mov -0x4(%rsp),%eax
40102d: c3 ret
40102e: 66 90 xchg %ax,%ax

Without optimization, g++ generates library calls
for the atomic accesses.

0000000000401106 <main>:
401106: 55 push %rbp
401107: 48 89 e5 mov %rsp,%rbp
40110a: 48 83 ec 30 sub $0x30,%rsp
40110e: 89 7d ec mov %edi,-0x14(%rbp)
401111: 48 89 75 e0 mov %rsi,-0x20(%rbp)
401115: 48 89 55 d8 mov %rdx,-0x28(%rbp)
401119: 48 89 4d d0 mov %rcx,-0x30(%rbp)
40111d: 48 8d 45 fc lea -0x4(%rbp),%rax
401121: be 15 35 01 00 mov $0x13515,%esi
401126: 48 89 c7 mov %rax,%rdi
401129: e8 22 00 00 00 call 401150 <std::__atomic_base<unsigned int>::operator=(unsigned int)>
40112e: 48 8d 45 fc lea -0x4(%rbp),%rax
401132: 48 89 c7 mov %rax,%rdi
401135: e8 9e 00 00 00 call 4011d8 <std::__atomic_base<unsigned int>::operator unsigned int() const>
40113a: c9 leave
40113b: c3 ret

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Bonita Montero on Tue Feb 4 19:59:46 2025

On 2/4/25 12:33, Bonita Montero wrote:

Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:

On 2/3/2025 8:25 PM, Bonita Montero wrote:

Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

Sigh.

These poll-only lock-free algorithms don't make sense.

Why do you assume poll-only?

Lock-free don't has kernel-waits.

The underlying queue is lock-free. Adding the eventcounts means
that it isn't technically lock-free any more. It's just a concurrent
queue. But one that's 25x faster than a concurrent queue
using mutexes and condvars. You are free to use the slower
implementations. Who's to know the difference if you don't tell
them.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From [email protected]@21:1/5 to All on Wed Feb 5 08:38:36 2025

On Tue, 4 Feb 2025 19:59:46 -0500
jseigh <[email protected]> wibbled:

On 2/4/25 12:33, Bonita Montero wrote:

Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:

On 2/3/2025 8:25 PM, Bonita Montero wrote:

Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

Sigh.

These poll-only lock-free algorithms don't make sense.

Why do you assume poll-only?

Lock-free don't has kernel-waits.

The underlying queue is lock-free. Adding the eventcounts means
that it isn't technically lock-free any more. It's just a concurrent
queue. But one that's 25x faster than a concurrent queue
using mutexes and condvars. You are free to use the slower
implementations. Who's to know the difference if you don't tell
them.

The kernel could implement these in any way it sees fit, so long as it stops and restarts the user space program at the correct time in, correct state with the correct data. Underlying implementations should always be black box to
user space.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to [email protected] on Wed Feb 5 06:59:41 2025

On 2/5/25 03:38, [email protected] wrote:

On Tue, 4 Feb 2025 19:59:46 -0500
jseigh <[email protected]> wibbled:

On 2/4/25 12:33, Bonita Montero wrote:

Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:

On 2/3/2025 8:25 PM, Bonita Montero wrote:

Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

Sigh.

These poll-only lock-free algorithms don't make sense.

Why do you assume poll-only?

Lock-free don't has kernel-waits.

The underlying queue is lock-free. Adding the eventcounts means
that it isn't technically lock-free any more. It's just a concurrent
queue. But one that's 25x faster than a concurrent queue
using mutexes and condvars. You are free to use the slower
implementations. Who's to know the difference if you don't tell
them.

The kernel could implement these in any way it sees fit, so long as it stops and restarts the user space program at the correct time in, correct state with
the correct data. Underlying implementations should always be black box to user space.

No argument there. Not entirely sure how this is germane to user
space code. Please clarify a bit more.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From [email protected]@21:1/5 to All on Wed Feb 5 13:30:27 2025

On Wed, 5 Feb 2025 06:59:41 -0500
jseigh <[email protected]> wibbled:

On 2/5/25 03:38, [email protected] wrote:

On Tue, 4 Feb 2025 19:59:46 -0500
jseigh <[email protected]> wibbled:

On 2/4/25 12:33, Bonita Montero wrote:

Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:

On 2/3/2025 8:25 PM, Bonita Montero wrote:

Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

Sigh.

These poll-only lock-free algorithms don't make sense.

Why do you assume poll-only?

Lock-free don't has kernel-waits.

The underlying queue is lock-free. Adding the eventcounts means
that it isn't technically lock-free any more. It's just a concurrent
queue. But one that's 25x faster than a concurrent queue
using mutexes and condvars. You are free to use the slower
implementations. Who's to know the difference if you don't tell
them.

The kernel could implement these in any way it sees fit, so long as it stops >> and restarts the user space program at the correct time in, correct state >with
the correct data. Underlying implementations should always be black box to >> user space.

No argument there. Not entirely sure how this is germane to user
space code. Please clarify a bit more.

Did you miss the bits above mentioning kernel-waits and underlying implementations?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to [email protected] on Wed Feb 5 12:15:18 2025

On 2/5/25 08:30, [email protected] wrote:

On Wed, 5 Feb 2025 06:59:41 -0500
jseigh <[email protected]> wibbled:

On 2/5/25 03:38, [email protected] wrote:

On Tue, 4 Feb 2025 19:59:46 -0500
jseigh <[email protected]> wibbled:

On 2/4/25 12:33, Bonita Montero wrote:

Am 04.02.2025 um 09:10 schrieb Chris M. Thomasson:

On 2/3/2025 8:25 PM, Bonita Montero wrote:

Am 04.02.2025 um 01:17 schrieb Chris M. Thomasson:

Sigh.

These poll-only lock-free algorithms don't make sense.

Why do you assume poll-only?

Lock-free don't has kernel-waits.

The underlying queue is lock-free. Adding the eventcounts means
that it isn't technically lock-free any more. It's just a concurrent
queue. But one that's 25x faster than a concurrent queue
using mutexes and condvars. You are free to use the slower
implementations. Who's to know the difference if you don't tell
them.

The kernel could implement these in any way it sees fit, so long as it stops
and restarts the user space program at the correct time in, correct state >> with
the correct data. Underlying implementations should always be black box to >>> user space.

No argument there. Not entirely sure how this is germane to user
space code. Please clarify a bit more.

Did you miss the bits above mentioning kernel-waits and underlying implementations?

Ok, I see the source of your confusion. Eventcounts and mutex/condvars
are 2 different synchronization mechanisms. On linux they both use
futexes. The mutex/condvar is slower not because of its implementation
but because there is a lock involved. If a thread time slice ends
with it holding the lock, other threads get blocked. With eventcounts
it doesn't matter if and where its time slice end happens because it
won't affect the progress of other threads. That's basically where
the 25x difference in performance is.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From [email protected]@21:1/5 to All on Thu Feb 6 08:19:29 2025

On Wed, 5 Feb 2025 12:15:18 -0500
jseigh <[email protected]> wibbled:

On 2/5/25 08:30, [email protected] wrote:

Did you miss the bits above mentioning kernel-waits and underlying
implementations?

Ok, I see the source of your confusion. Eventcounts and mutex/condvars
are 2 different synchronization mechanisms. On linux they both use
futexes. The mutex/condvar is slower not because of its implementation
but because there is a lock involved. If a thread time slice ends
with it holding the lock, other threads get blocked. With eventcounts
it doesn't matter if and where its time slice end happens because it
won't affect the progress of other threads. That's basically where
the 25x difference in performance is.

Presumably however a lock must be involved somewhere even if in kernel space when waiting on a futex/eventcount otherwise you risk race conditions. Also
I don't understand people talking about polling - if polling were a solution
to thread synchronisation then mutexes wouldn't exist.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to [email protected] on Thu Feb 6 11:53:44 2025

On 2/6/25 03:19, [email protected] wrote:

On Wed, 5 Feb 2025 12:15:18 -0500
jseigh <[email protected]> wibbled:

On 2/5/25 08:30, [email protected] wrote:

Did you miss the bits above mentioning kernel-waits and underlying
implementations?

Ok, I see the source of your confusion. Eventcounts and mutex/condvars
are 2 different synchronization mechanisms. On linux they both use
futexes. The mutex/condvar is slower not because of its implementation
but because there is a lock involved. If a thread time slice ends
with it holding the lock, other threads get blocked. With eventcounts
it doesn't matter if and where its time slice end happens because it
won't affect the progress of other threads. That's basically where
the 25x difference in performance is.

Presumably however a lock must be involved somewhere even if in kernel space when waiting on a futex/eventcount otherwise you risk race conditions. Also

The kernel is non-preemptive. It typically uses short term locks, spin
locks mostly, to protect kernel data structures. futex wait is only
going to hold a lock long enough to put the waiting user thread on a
wait queue. A thread in futex wait won't block any other user threads.
If another thread does a wake on the futex, the kernal will get its
internal lock, select the appropriate number of threads to wake,
remove them from the futex wait queue and put them in the scheduler
run queue.

I don't understand people talking about polling - if polling were a solution to thread synchronisation then mutexes wouldn't exist.

Somebody thinks there is polling/busy waiting involved. There isn't.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Sun Feb 9 20:57:11 2025

On 2/2/25 10:07, jseigh wrote:

On 1/31/25 15:47, jseigh wrote:

I did a comparison to std::atomic<uint32_t> on linux which is probably
using a futex. It's about 2x to 3x slower than using eventcounts.

I also tried with counting semaphores. Also about 3x slower than
eventcounts but slightly faster than std::atomic.

A little bit hacky getting semaphores to work with closeable queues
but all good.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Thu Jul 30 14:17:17 2026
  from Madison, Nc via Telnet
- Krenn
  Thu Jul 30 13:16:49 2026
  from Sydney, Nsw via Telnet
- Bob Worm
  Thu Jul 30 09:03:28 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:47:34 2026
  from Wales, Uk via Telnet
- Bob Worm
  Thu Jul 30 08:36:06 2026
  from Wales, Uk via Telnet
- Rixter
  Thu Jul 30 02:32:09 2026
  from Madison, Nc via Telnet
- Bob Worm
  Wed Jul 29 22:26:45 2026
  from Wales, Uk via Telnet
- Zenobyte
  Wed Jul 29 21:08:05 2026
  from San Juan, Pr via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	94:13:56
Calls:	12,456
Calls today:	6
Files:	15,197
Messages:	6,537,915

eventcount

Who's Online

Recent Visitors

System Info