Forum: >>> Magnum BBS <<<

smrproxy v2

From jseigh@21:1/5 to All on Thu Oct 17 08:10:20 2024

I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Thu Oct 17 17:08:04 2024

On 10/17/24 16:10, Chris M. Thomasson wrote:

On 10/17/2024 5:10 AM, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

I have to take a look at it! Been really busy lately. Shit happens.

There's a quick and dirty explanation at
http://threadnought.wordpress.com/

repo at https://github.com/jseigh/smrproxy

I'll need to create some memory access diagrams that
visualize how it works at some point.

Anyway if it's new, another algorithm to use without
attribution.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Fri Oct 18 08:07:11 2024

On 10/17/24 19:40, Chris M. Thomasson wrote:

On 10/17/2024 2:08 PM, jseigh wrote:

On 10/17/24 16:10, Chris M. Thomasson wrote:

On 10/17/2024 5:10 AM, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

I have to take a look at it! Been really busy lately. Shit happens.

There's a quick and dirty explanation at
http://threadnought.wordpress.com/

repo at https://github.com/jseigh/smrproxy

I'll need to create some memory access diagrams that
visualize how it works at some point.

Anyway if it's new, another algorithm to use without
attribution.

Interesting. From a quick view, it kind of reminds me of a distributed seqlock for some reason. Are you using an asymmetric membar in here? in smr_poll ?

Yes, linux membarrier() in smr_poll.

Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.

A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.

Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible. The qsbr logic was mostly ripped out but there were still
some pieces there.

Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy. There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Fri Oct 25 18:56:16 2024

On 10/25/24 18:00, Chris M. Thomasson wrote:

On 10/18/2024 5:07 AM, jseigh wrote:

On 10/17/24 19:40, Chris M. Thomasson wrote:

On 10/17/2024 2:08 PM, jseigh wrote:

On 10/17/24 16:10, Chris M. Thomasson wrote:

On 10/17/2024 5:10 AM, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free >>>>>> instead of mostly wait-free. The reader lock logic after loading >>>>>> the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting >>>>>> it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

I have to take a look at it! Been really busy lately. Shit happens.

There's a quick and dirty explanation at
http://threadnought.wordpress.com/

repo at https://github.com/jseigh/smrproxy

I'll need to create some memory access diagrams that
visualize how it works at some point.

Anyway if it's new, another algorithm to use without
attribution.

Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?

Yes, linux membarrier() in smr_poll.

Not seqlock, not least for the reason that exiting the critical region
is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.

A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.

Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible. The qsbr logic was mostly ripped out but there were still
some pieces there.

Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy. There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.

Damn! I almost missed this post! Fucking Thunderbird... Will get back to
you. Working on something else right now Joe, thanks.

https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/

No problem. The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Sun Oct 27 18:29:43 2024

On 10/27/24 15:33, Chris M. Thomasson wrote:

On 10/25/2024 3:56 PM, jseigh wrote:

On 10/25/24 18:00, Chris M. Thomasson wrote:

On 10/18/2024 5:07 AM, jseigh wrote:

On 10/17/24 19:40, Chris M. Thomasson wrote:

On 10/17/2024 2:08 PM, jseigh wrote:

On 10/17/24 16:10, Chris M. Thomasson wrote:

On 10/17/2024 5:10 AM, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait- >>>>>>>> free
instead of mostly wait-free. The reader lock logic after loading >>>>>>>> the address of the reader lock object into a register is now 2 >>>>>>>> instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting >>>>>>>> it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent. >>>>>>>> Though I suppose you could argue it's qsbr if I point out what >>>>>>>> the quiescent states are.

I have to take a look at it! Been really busy lately. Shit happens. >>>>>>>

There's a quick and dirty explanation at
http://threadnought.wordpress.com/

repo at https://github.com/jseigh/smrproxy

I'll need to create some memory access diagrams that
visualize how it works at some point.

Anyway if it's new, another algorithm to use without
attribution.

Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric
membar in here? in smr_poll ?

Yes, linux membarrier() in smr_poll.

Not seqlock, not least for the reason that exiting the critical region >>>> is 3 instructions unless you use atomics which are expensive and have
memory barriers usually.

A lot of the qsbr and ebr reader lock/unlock code is going to look
somewhat similar so you have to know how the reclaim logic uses it.
In this case I am slingshotting off of the asymmetric memory barrier.

Earlier at one point I was going to have smrproxy use hazard pointer
logic or qsbr logic as a config option, but the extra code complexity
and the fact that qsbr required 2 grace periods kind of made that
unfeasible. The qsbr logic was mostly ripped out but there were still >>>> some pieces there.

Anyway I'm working a c++ version which involves a lot of extra work
besides just rewriting smrproxy. There coming up with an api for
proxies and testcases which tend to be more work than the code that
they are testing.

Damn! I almost missed this post! Fucking Thunderbird... Will get back
to you. Working on something else right now Joe, thanks.

https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/

No problem. The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.

Never even tried Rust, shit, I am behind the times. ;^)

Humm... I don't think we can get 100% C++ because of the damn asymmetric membar for these rather "specialized" algorithms?

Is C++ thinking about creating a standard way to gain an asymmetric membar?

I don't think so. It's platform dependent. Apart from linux, mostly
it's done with a call to some virtual memory function that flushes
the TLBs (translation lookaside buffers) which involves IPI calls
to all the processors and those have memory barriers. This is
old, 1973, patent 3,947,823 cited by the patent I did.

Anyway, I version the code so there's a asymmetric memory barrier
version and an explicit memory barrier version, the latter
being much slower.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Sun Oct 27 20:35:54 2024

On 10/27/24 18:32, Chris M. Thomasson wrote:

On 10/27/2024 3:29 PM, jseigh wrote:

On 10/27/24 15:33, Chris M. Thomasson wrote:

On 10/25/2024 3:56 PM, jseigh wrote:

On 10/25/24 18:00, Chris M. Thomasson wrote:

On 10/18/2024 5:07 AM, jseigh wrote:

On 10/17/24 19:40, Chris M. Thomasson wrote:

On 10/17/2024 2:08 PM, jseigh wrote:

On 10/17/24 16:10, Chris M. Thomasson wrote:

On 10/17/2024 5:10 AM, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now >>>>>>>>>> wait- free
instead of mostly wait-free. The reader lock logic after loading >>>>>>>>>> the address of the reader lock object into a register is now 2 >>>>>>>>>> instructions a load followed by a store. The unlock is same >>>>>>>>>> as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting >>>>>>>>>> it to c++ and don't want to waste any more time on c version. >>>>>>>>>>
No idea of it's a new algorithm. I suspect that since I use >>>>>>>>>> the term epoch that it will be claimed that it's ebr, epoch >>>>>>>>>> based reclamation, and that all ebr algorithms are equivalent. >>>>>>>>>> Though I suppose you could argue it's qsbr if I point out what >>>>>>>>>> the quiescent states are.

I have to take a look at it! Been really busy lately. Shit
happens.

There's a quick and dirty explanation at
http://threadnought.wordpress.com/

repo at https://github.com/jseigh/smrproxy

I'll need to create some memory access diagrams that
visualize how it works at some point.

Anyway if it's new, another algorithm to use without
attribution.

Interesting. From a quick view, it kind of reminds me of a
distributed seqlock for some reason. Are you using an asymmetric >>>>>>> membar in here? in smr_poll ?

Yes, linux membarrier() in smr_poll.

Not seqlock, not least for the reason that exiting the critical
region
is 3 instructions unless you use atomics which are expensive and have >>>>>> memory barriers usually.

A lot of the qsbr and ebr reader lock/unlock code is going to look >>>>>> somewhat similar so you have to know how the reclaim logic uses it. >>>>>> In this case I am slingshotting off of the asymmetric memory barrier. >>>>>>
Earlier at one point I was going to have smrproxy use hazard pointer >>>>>> logic or qsbr logic as a config option, but the extra code complexity >>>>>> and the fact that qsbr required 2 grace periods kind of made that
unfeasible. The qsbr logic was mostly ripped out but there were
still
some pieces there.

Anyway I'm working a c++ version which involves a lot of extra work >>>>>> besides just rewriting smrproxy. There coming up with an api for >>>>>> proxies and testcases which tend to be more work than the code that >>>>>> they are testing.

Damn! I almost missed this post! Fucking Thunderbird... Will get
back to you. Working on something else right now Joe, thanks.

https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/

No problem. The c++ work is progressing pretty slowly, not least in
part because the documentation is not always clear as to what
something does or even what problem it is supposed to solve.
To think I took a pass on on rust because I though it was
more complicated than it needed to be.

Never even tried Rust, shit, I am behind the times. ;^)

Humm... I don't think we can get 100% C++ because of the damn
asymmetric membar for these rather "specialized" algorithms?

Is C++ thinking about creating a standard way to gain an asymmetric
membar?

I don't think so. It's platform dependent. Apart from linux, mostly
it's done with a call to some virtual memory function that flushes
the TLBs (translation lookaside buffers) which involves IPI calls
to all the processors and those have memory barriers. This is
old, 1973, patent 3,947,823 cited by the patent I did.

Anyway, I version the code so there's a asymmetric memory barrier
version and an explicit memory barrier version, the latter
being much slower.

Ahh, nice! acquire/release, no seq_cst, right? ;^)

The membar version? That's a store/load membar so it is expensive.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Mon Oct 28 07:45:15 2024

On 10/28/24 00:02, Chris M. Thomasson wrote:

On 10/27/2024 5:35 PM, jseigh wrote:

On 10/27/24 18:32, Chris M. Thomasson wrote:

The membar version? That's a store/load membar so it is expensive.

I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I say
C++, I mean pure C++, no calls to FlushProcessWriteBuffers and things
like that.

I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?

While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.

For cmpxchg, it has full cst_seq. For other rmw atomics I don't
know. I have to ask on c.a. I think some data dependency and/or
control dependency might factor in.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Mon Oct 28 21:17:55 2024

On 10/28/24 17:57, Chris M. Thomasson wrote:

On 10/28/2024 4:45 AM, jseigh wrote:

On 10/28/24 00:02, Chris M. Thomasson wrote:

On 10/27/2024 5:35 PM, jseigh wrote:

On 10/27/24 18:32, Chris M. Thomasson wrote:

The membar version? That's a store/load membar so it is expensive.

I was wondering in your c++ version if you had to use any seq_cst
barriers. I think acquire/release should be good enough. Now, when I
say C++, I mean pure C++, no calls to FlushProcessWriteBuffers and
things like that.

I take it that your pure C++ version has no atomic RMW, right? Just
loads and stores?

While a lock action has acquire memory order semantics, if the
implementation has internal stores, you have to those stores
are complete before any access from the critical section.
So you may need a store/load memory barrier.

Wrt acquiring a lock the only class of mutex logic that comes to mind
that requires an explicit storeload style membar is Petersons, and some others along those lines, so to speak. This is for the store and load version. Now, RMW on x86 basically implies a StoreLoad wrt the LOCK
prefix, XCHG aside for it has an implied LOCK prefix. For instance the original SMR algo requires a storeload as is on x86/x64. MFENCE or LOCK prefix.

Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in RMO
mode. On x86, the LOCK prefix handles that wrt the RMW's themselves.
This is a lot different than using stores and loads. The original SMR
and Peterson's algo needs that "store followed by a load to a different location" action to hold true, aka, storeload...

Now, I don't think that a data-dependant load can act like a storeload.
I thought that they act sort of like an acquire, aka #LoadStore |
#LoadLoad wrt SPARC. SPARC in RMO mode honors data-dependencies. Now,
the DEC Alpha is a different story... ;^)

fwiw, here's the lock and unlock logic from smrproxy rewrite

inline void lock()
{
epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
_ref_epoch.store(_epoch, std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_acquire);
}

inline void unlock()
{
_ref_epoch.store(0, std::memory_order_release);
}

epoch_t is interesting. It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n

x1 < (x1 + n)

for any value of x1 and any value of n from 0 to 2**63;
eg.
0xfffffffffffffff0 < 0x0000000000000001

The rewrite is almost complete except for some thread_local
stuff. I think I might break off there. Most of the
additional work is writing the test code. I'm considering
rewriting it in Rust.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Tue Oct 29 07:27:40 2024

On 10/29/24 00:38, Chris M. Thomasson wrote:

On 10/28/2024 6:17 PM, jseigh wrote:

On 10/28/24 17:57, Chris M. Thomasson wrote:

On 10/28/2024 4:45 AM, jseigh wrote:

fwiw, here's the lock and unlock logic from smrproxy rewrite

     inline void lock()
     {
         epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
         _ref_epoch.store(_epoch, std::memory_order_relaxed);
         std::atomic_signal_fence(std::memory_order_acquire);
     }

     inline void unlock()
     {
         _ref_epoch.store(0, std::memory_order_release);
     }

epoch_t is interesting. It's uint64_t but handles wrapped
compares, ie. for an epoch_t x1 and uint64_t n

Only your single polling thread can mutate the shadow_epoch, right?

Yes. It's just an optimization. The reader threads could read
from the global epoch but it would be in a separate cache line
and be an extra dependent load. So one dependent load and
same cache line.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Tue Oct 29 07:23:56 2024

On 10/29/24 00:35, Chris M. Thomasson wrote:

On 10/28/2024 6:17 PM, jseigh wrote:

On 10/28/24 17:57, Chris M. Thomasson wrote:

On 10/28/2024 4:45 AM, jseigh wrote:

fwiw, here's the lock and unlock logic from smrproxy rewrite

     inline void lock()
     {
         epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
         _ref_epoch.store(_epoch, std::memory_order_relaxed);

         std::atomic_signal_fence(std::memory_order_acquire);

^^^^^^^^^^^^^^^^^^^^^^

     }

Still don't know how your pure C++ write up can handle this without an std::atomic_thread_fence(std::memory_order_acquire).

No thread fence is necessary. The loads can move before
the store. They just can't move before the async
membar. After that membar any previously retired
objects are no longer reachable.

     inline void unlock()
     {
         _ref_epoch.store(0, std::memory_order_release);
     }

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Wed Oct 30 12:39:38 2024

On 10/29/24 18:05, Chris M. Thomasson wrote:

On 10/28/2024 9:41 PM, Chris M. Thomasson wrote:

Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will work. I don't see why it would not work.

For some reason, I thought you were going to not use an async membar in
your C++ version. Sorry. However, it still would be fun to test
against... ;^)

The C version has both versions. The C++ version does only the
async member version. But I'm not publishing that code so it's
a moot point.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Wed Oct 30 12:36:45 2024

On 10/29/24 17:56, Chris M. Thomasson wrote:

On 10/29/2024 4:27 AM, jseigh wrote:

Yes. It's just an optimization. The reader threads could read
from the global epoch but it would be in a separate cache line
and be an extra dependent load. So one dependent load and
same cache line.

Are you taking advantage of the fancy alignment capabilities of C++?

https://en.cppreference.com/w/cpp/language/alignas

and friends? They seem to work fine wrt the last time I checked them.

It's nice to have a standard way to pad and align on cache line
boundaries. :^)

It's target processor dependent. You need to query the cache line size
and pass to compile as a define variable.

There's supposed to be built in define that would be target system
dependent but it's not implemented.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Wed Oct 30 12:43:09 2024

On 10/30/24 02:54, Chris M. Thomasson wrote:

On 10/17/2024 5:10 AM, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

Joe, can you call dtors for nodes after a single epoch?

Yes since I can check that epochs are being referenced
directly instead of indirectly via quiescent states.
It's the inferring from events vs. conditions thing.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Mon Nov 4 07:46:37 2024

On 11/4/24 00:14, Chris M. Thomasson wrote:

On 10/30/2024 9:39 AM, jseigh wrote:

On 10/29/24 18:05, Chris M. Thomasson wrote:

On 10/28/2024 9:41 PM, Chris M. Thomasson wrote:

Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will
work. I don't see why it would not work.

For some reason, I thought you were going to not use an async membar
in your C++ version. Sorry. However, it still would be fun to test
against... ;^)

The C version has both versions. The C++ version does only the
async member version. But I'm not publishing that code so it's
a moot point.

I got side tracked with more heavy math. The problem with C++ code that
uses an async memory barrier is that its automatically rendered into a non-portable state... Yikes! Imvvvvvho, C/C++ should think about
including them in some future standard. It would be nice. Well, for us
at least! ;^)

That's never going to happen. DWCAS has been around for more than
50 years and c++ doesn't support that and probably never will.
You can't write lock-free queues that are ABA free and
are performant without that. So async memory barriers won't
happen any time soon either.

Long term I think c++ will fade into irrelevance along with
all the other programming languages based on an imperfect
knowledge of concurrency, which is basically all of them
right now.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From [email protected]@21:1/5 to All on Mon Nov 4 14:09:14 2024

On Mon, 4 Nov 2024 07:46:37 -0500
jseigh <[email protected]> boring babbled:

On 11/4/24 00:14, Chris M. Thomasson wrote:

On 10/30/2024 9:39 AM, jseigh wrote:

On 10/29/24 18:05, Chris M. Thomasson wrote:

On 10/28/2024 9:41 PM, Chris M. Thomasson wrote:

Ahhh, if you are using an async membar in your upcoming C++ version,
then it would be fine. No problem. A compiler fence ala
atomic_signal_fence, and the the explicit release, well, it will
work. I don't see why it would not work.

For some reason, I thought you were going to not use an async membar
in your C++ version. Sorry. However, it still would be fun to test
against... ;^)

The C version has both versions. The C++ version does only the
async member version. But I'm not publishing that code so it's
a moot point.

I got side tracked with more heavy math. The problem with C++ code that
uses an async memory barrier is that its automatically rendered into a
non-portable state... Yikes! Imvvvvvho, C/C++ should think about
including them in some future standard. It would be nice. Well, for us
at least! ;^)

That's never going to happen. DWCAS has been around for more than
50 years and c++ doesn't support that and probably never will.
You can't write lock-free queues that are ABA free and
are performant without that. So async memory barriers won't
happen any time soon either.

Long term I think c++ will fade into irrelevance along with
all the other programming languages based on an imperfect
knowledge of concurrency, which is basically all of them
right now.

Given most concurrent operating systems are written in these "imperfect" languages how does that square with your definition? And how would your
perfect language run on them?

Anyway, concurrency is the job of the OS, not the language. C++ threading is just a wrapper around pthreads on *nix and windows threads on Windows. The language just needs an interface to the underlying OS functionality, it should not try and implement the functionality itself as it'll always be a hack.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Thu Nov 21 15:17:32 2024

On 10/17/24 08:10, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking

The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type. You can do apple to apple comparisons. I
realize that's the complete antithesis of current
programming practices but there you have it. :)

A bit of clean up and performance tuning now.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Sat Nov 23 11:10:46 2024

On 11/21/24 15:17, jseigh wrote:

On 10/17/24 08:10, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking

The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type. You can do apple to apple comparisons. I
realize that's the complete antithesis of current
programming practices but there you have it. :)

A bit of clean up and performance tuning now.

Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Sat Nov 23 17:12:56 2024

On 11/23/24 16:31, Chris M. Thomasson wrote:

On 11/23/2024 8:10 AM, jseigh wrote:

On 11/21/24 15:17, jseigh wrote:

On 10/17/24 08:10, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking

The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type. You can do apple to apple comparisons. I
realize that's the complete antithesis of current
programming practices but there you have it. :)

A bit of clean up and performance tuning now.

Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.

Nice! Are you using pthread_getspecific or tss_get in you C version?

Just thread_local.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Sun Nov 24 15:14:58 2024

On 11/23/24 11:10, jseigh wrote:

On 11/21/24 15:17, jseigh wrote:

On 10/17/24 08:10, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free
instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking

The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type. You can do apple to apple comparisons. I
realize that's the complete antithesis of current
programming practices but there you have it. :)

A bit of clean up and performance tuning now.

Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.

I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking. For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were. You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Sun Nov 24 19:09:05 2024

On 11/24/24 18:19, Chris M. Thomasson wrote:

On 11/24/2024 12:14 PM, jseigh wrote:

On 11/23/24 11:10, jseigh wrote:

On 11/21/24 15:17, jseigh wrote:

On 10/17/24 08:10, jseigh wrote:

I replaced the hazard pointer logic in smrproxy. It's now wait-free >>>>> instead of mostly wait-free. The reader lock logic after loading
the address of the reader lock object into a register is now 2
instructions a load followed by a store. The unlock is same
as before, just a store.

It's way faster now.

It's on the feature/003 branch as a POC. I'm working on porting
it to c++ and don't want to waste any more time on c version.

No idea of it's a new algorithm. I suspect that since I use
the term epoch that it will be claimed that it's ebr, epoch
based reclamation, and that all ebr algorithms are equivalent.
Though I suppose you could argue it's qsbr if I point out what
the quiescent states are.

I got a port to c++ working now. There are 5 proxy implementations
1) smrproxy v2
2) arcproxy - reference counted proxy
3) rwlock based proxy
4) mutex based proxy
5) an unsafe proxy with no locking

The testcase is templated so you can use any of the
5 proxy implementations without rewriting for each proxy
type. You can do apple to apple comparisons. I
realize that's the complete antithesis of current
programming practices but there you have it. :)

A bit of clean up and performance tuning now.

Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.

I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking. For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were. You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.

I remember back in the day when I was comparing and contrasting various lock/wait-free algorithms with their 100% lock-based counter parts. Some
of the lock-based tests too so long that I just terminated the damn
program. Iirc, a lock-free test would take around 5 minutes. The lock-
based test would be around 30+ minutes. This was way back on c.p.t.

I set the iteration count as a parameter. Mutex can be particularly
slow with a lot of reader threads. I usually see about 1000 - 10000
times slower than smrproxy. rwlocks aren't as bad, about 200 x
slower.

Mutex, rwlock, and arcproxy use interlocked instructions so you
can get a really wide performance range based on cache geometry
and processor sets you run on.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Mon Nov 25 06:31:32 2024

On 11/24/24 19:47, Chris M. Thomasson wrote:

On 11/24/2024 4:09 PM, jseigh wrote:

...

Actually, I remember way back where a scenario that had a lot of writes
would start to mess with a deferred reclamation wrt a polling thread
type of “scheme”. Too many deferred nodes would start to "pile up". Basically, the single polling thread was having trouble keeping up with
all of them. The interlocked versions seemed to perform sort of "better"
in a sense during periods that had a lot of frequent “writes”. Of course clever use of node caches helps heavy write periods. Anyway, some of the tests just used a mutex for writes, others used lock-free and would
generate high loads of them that would push and pop nodes and defer them
to the poll thread to test how much load it (poll thread) could take.

Using deferred reclamation to implement a lock-free queue will slow
things down. Also cache line thrashing will limit how fast queue
operations will go no matter what.

Better to parallelize. You can take a really bad lock-free queue implementation, parallelize it, and then tell everyone how fast
your queue is.

Single queue at 1 op/sec.
Parallel queue on 1000000 processors, 1000000 op/sec.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Wed Nov 27 10:29:05 2024

On 11/24/24 15:14, jseigh wrote:

On 11/23/24 11:10, jseigh wrote:

On 11/21/24 15:17, jseigh wrote:

Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
about what the C version was.

I've been using cpu time to measure performance. That's ok
for lock-free/wait-free locking. For normal mutexes and
shared locks, it doesn't measure wait time so those didn't
look as bad as they really were. You can add logic
to measure how long it takes to acquire a lock but that
adds significant overhead.

Some timings with 128 reader threads

unsafe 52.983 nsecs ( 0.000) 860.576 nsecs ( 0.000)
smr 54.714 nsecs ( 1.732) 882.356 nsecs ( 21.780) smrlite 53.149 nsecs ( 0.166) 870.066 nsecs ( 9.490)
arc 739.833 nsecs ( 686.850) 11,988.289 nsecs ( 11,127.713)
rwlock 1,078.306 nsecs ( 1,025.323) 17,309.882 nsecs ( 16,449.306)
mutex 3,203.034 nsecs ( 3,150.052) 51,479.407 nsecs ( 50,618.831)

The first column is cpu time, third column is elapsed time.
unsafe is without any synchronized reader access. The
value in parentheses is the unsafe access time subtracted
out to separate out the synchronization overheads. smrlite is
smr proxy with thread_local overhead. So smrproxy lock/unlock
by itself is about 0.1 - 0.2 nanoseconds.

I'm going to drop working on the whole proxy interface thing. The
application can decide if it wants to hardcode a dependency on a
particular 3rd party libarary implementation or abstract it out
to a more portable api.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Mon Dec 9 13:34:11 2024

On 11/27/24 10:29, jseigh wrote:

Some timings with 128 reader threads

unsafe        52.983 nsecs (      0.000)     860.576 nsecs (      0.000)
smr           54.714 nsecs (      1.732)     882.356 nsecs (     21.780)
smrlite       53.149 nsecs (      0.166)     870.066 nsecs (      9.490)
arc          739.833 nsecs (    686.850) 11,988.289 nsecs ( 11,127.713)
rwlock     1,078.306 nsecs ( 1,025.323) 17,309.882 nsecs ( 16,449.306)
mutex      3,203.034 nsecs ( 3,150.052) 51,479.407 nsecs ( 50,618.831)

The first column is cpu time, third column is elapsed time.
unsafe is without any synchronized reader access.   The
value in parentheses is the unsafe access time subtracted
out to separate out the synchronization overheads. smrlite is
smr proxy with thread_local overhead. So smrproxy lock/unlock
by itself is about 0.1 - 0.2 nanoseconds.

I'm going to drop working on the whole proxy interface thing. The application can decide if it wants to hardcode a dependency on a
particular 3rd party libarary implementation or abstract it out
to a more portable api.

I figured out where the smr vs smrlite overhead is likely coming from.

1) thread_local load about .3 nsecs, 2 for lock/unlock so .6 nsecs.
2) overhead from lazy initialization, about .6 nsecs.

smrlite most of the time doesn't show any measurable overhead,
0 nsecs.

Theoretically, you could do do lazy initialization with zero
runtime overhead, but for most c++ apps, 1 millisecond is
considered fast, so I don't think there would be much interest
in it.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to jseigh on Wed Dec 11 07:13:30 2024

On 12/9/24 13:34, jseigh wrote:

On 11/27/24 10:29, jseigh wrote:

I'm going to drop working on the whole proxy interface thing. The
application can decide if it wants to hardcode a dependency on a
particular 3rd party libarary implementation or abstract it out
to a more portable api.

I figured out where the smr vs smrlite overhead is likely coming from.

1) thread_local load about .3 nsecs, 2 for lock/unlock so .6 nsecs.
2) overhead from lazy initialization, about .6 nsecs.

smrlite most of the time doesn't show any measurable overhead,
0 nsecs.

Fixed the whole thread_local issue. Just don't use it. I
went back to the api I was using in C and leave it up to
the app to keep track of thread local objects. Fixes the
problem of multiple proxies needing multiple per thread
objects. This is a language issue, not an api issue.

Theoretically, you could do do lazy initialization with zero
runtime overhead, but for most c++ apps, 1 millisecond is
considered fast, so I don't think there would be much interest
in it.

And apparantly I am right. :)

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Thu Dec 12 07:29:11 2024

On 12/12/24 03:43, Chris M. Thomasson wrote:

On 11/4/2024 4:46 AM, jseigh wrote:

Hummm... If I remember correctly, you said something about using a
simple atomic exchange to pop a whole list (lock-free stack), then
simple reversing the list to get a fifo order? Do you remember any of
that way back on c.p.t?

That kind of stuff pre-dates c.p.t. even.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Thu Dec 12 19:48:12 2024

On 12/12/24 16:55, Chris M. Thomasson wrote:

On 12/12/2024 4:29 AM, jseigh wrote:

On 12/12/24 03:43, Chris M. Thomasson wrote:

On 11/4/2024 4:46 AM, jseigh wrote:

Hummm... If I remember correctly, you said something about using a
simple atomic exchange to pop a whole list (lock-free stack), then
simple reversing the list to get a fifo order? Do you remember any of
that way back on c.p.t?

That kind of stuff pre-dates c.p.t. even.

Has to. Although, it was not in the IBM principles of operation where
the describe their lock-free stack in an appendix, iirc... I cannot
remember the exact one right now. Iirc, it was under free pool
manipulation?

If you pop off a LIFO stack onto another LIFO, everything becomes FIFO.

Also, there's a doubly linked stack where the back links are lazily initialized. That seems to get reinvented every so often.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	76:48:20
Calls:	12,450
Files:	15,194
Messages:	6,537,676

smrproxy v2

Who's Online

System Info