• smrproxy v2

    From jseigh@21:1/5 to All on Thu Oct 17 08:10:20 2024
    I replaced the hazard pointer logic in smrproxy. It's now wait-free
    instead of mostly wait-free. The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store. The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC. I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm. I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Thu Oct 17 17:08:04 2024
    On 10/17/24 16:10, Chris M. Thomasson wrote:
    On 10/17/2024 5:10 AM, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free
    instead of mostly wait-free.  The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.

    I have to take a look at it! Been really busy lately. Shit happens.


    There's a quick and dirty explanation at
    http://threadnought.wordpress.com/

    repo at https://github.com/jseigh/smrproxy

    I'll need to create some memory access diagrams that
    visualize how it works at some point.

    Anyway if it's new, another algorithm to use without
    attribution.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Fri Oct 18 08:07:11 2024
    On 10/17/24 19:40, Chris M. Thomasson wrote:
    On 10/17/2024 2:08 PM, jseigh wrote:
    On 10/17/24 16:10, Chris M. Thomasson wrote:
    On 10/17/2024 5:10 AM, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free
    instead of mostly wait-free.  The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.

    I have to take a look at it! Been really busy lately. Shit happens.


    There's a quick and dirty explanation at
    http://threadnought.wordpress.com/

    repo at https://github.com/jseigh/smrproxy

    I'll need to create some memory access diagrams that
    visualize how it works at some point.

    Anyway if it's new, another algorithm to use without
    attribution.

    Interesting. From a quick view, it kind of reminds me of a distributed seqlock for some reason. Are you using an asymmetric membar in here? in smr_poll ?

    Yes, linux membarrier() in smr_poll.

    Not seqlock, not least for the reason that exiting the critical region
    is 3 instructions unless you use atomics which are expensive and have
    memory barriers usually.

    A lot of the qsbr and ebr reader lock/unlock code is going to look
    somewhat similar so you have to know how the reclaim logic uses it.
    In this case I am slingshotting off of the asymmetric memory barrier.

    Earlier at one point I was going to have smrproxy use hazard pointer
    logic or qsbr logic as a config option, but the extra code complexity
    and the fact that qsbr required 2 grace periods kind of made that
    unfeasible. The qsbr logic was mostly ripped out but there were still
    some pieces there.

    Anyway I'm working a c++ version which involves a lot of extra work
    besides just rewriting smrproxy. There coming up with an api for
    proxies and testcases which tend to be more work than the code that
    they are testing.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Fri Oct 25 18:56:16 2024
    On 10/25/24 18:00, Chris M. Thomasson wrote:
    On 10/18/2024 5:07 AM, jseigh wrote:
    On 10/17/24 19:40, Chris M. Thomasson wrote:
    On 10/17/2024 2:08 PM, jseigh wrote:
    On 10/17/24 16:10, Chris M. Thomasson wrote:
    On 10/17/2024 5:10 AM, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free >>>>>> instead of mostly wait-free.  The reader lock logic after loading >>>>>> the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting >>>>>> it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.

    I have to take a look at it! Been really busy lately. Shit happens.


    There's a quick and dirty explanation at
    http://threadnought.wordpress.com/

    repo at https://github.com/jseigh/smrproxy

    I'll need to create some memory access diagrams that
    visualize how it works at some point.

    Anyway if it's new, another algorithm to use without
    attribution.

    Interesting. From a quick view, it kind of reminds me of a
    distributed seqlock for some reason. Are you using an asymmetric
    membar in here? in smr_poll ?

    Yes, linux membarrier() in smr_poll.

    Not seqlock, not least for the reason that exiting the critical region
    is 3 instructions unless you use atomics which are expensive and have
    memory barriers usually.

    A lot of the qsbr and ebr reader lock/unlock code is going to look
    somewhat similar so you have to know how the reclaim logic uses it.
    In this case I am slingshotting off of the asymmetric memory barrier.

    Earlier at one point I was going to have smrproxy use hazard pointer
    logic or qsbr logic as a config option, but the extra code complexity
    and the fact that qsbr required 2 grace periods kind of made that
    unfeasible.  The qsbr logic was mostly ripped out but there were still
    some pieces there.

    Anyway I'm working a c++ version which involves a lot of extra work
    besides just rewriting smrproxy.  There coming up with an api for
    proxies and testcases which tend to be more work than the code that
    they are testing.

    Damn! I almost missed this post! Fucking Thunderbird... Will get back to
    you. Working on something else right now Joe, thanks.

    https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/


    No problem. The c++ work is progressing pretty slowly, not least in
    part because the documentation is not always clear as to what
    something does or even what problem it is supposed to solve.
    To think I took a pass on on rust because I though it was
    more complicated than it needed to be.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Sun Oct 27 18:29:43 2024
    On 10/27/24 15:33, Chris M. Thomasson wrote:
    On 10/25/2024 3:56 PM, jseigh wrote:
    On 10/25/24 18:00, Chris M. Thomasson wrote:
    On 10/18/2024 5:07 AM, jseigh wrote:
    On 10/17/24 19:40, Chris M. Thomasson wrote:
    On 10/17/2024 2:08 PM, jseigh wrote:
    On 10/17/24 16:10, Chris M. Thomasson wrote:
    On 10/17/2024 5:10 AM, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait- >>>>>>>> free
    instead of mostly wait-free.  The reader lock logic after loading >>>>>>>> the address of the reader lock object into a register is now 2 >>>>>>>> instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting >>>>>>>> it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent. >>>>>>>> Though I suppose you could argue it's qsbr if I point out what >>>>>>>> the quiescent states are.

    I have to take a look at it! Been really busy lately. Shit happens. >>>>>>>

    There's a quick and dirty explanation at
    http://threadnought.wordpress.com/

    repo at https://github.com/jseigh/smrproxy

    I'll need to create some memory access diagrams that
    visualize how it works at some point.

    Anyway if it's new, another algorithm to use without
    attribution.

    Interesting. From a quick view, it kind of reminds me of a
    distributed seqlock for some reason. Are you using an asymmetric
    membar in here? in smr_poll ?

    Yes, linux membarrier() in smr_poll.

    Not seqlock, not least for the reason that exiting the critical region >>>> is 3 instructions unless you use atomics which are expensive and have
    memory barriers usually.

    A lot of the qsbr and ebr reader lock/unlock code is going to look
    somewhat similar so you have to know how the reclaim logic uses it.
    In this case I am slingshotting off of the asymmetric memory barrier.

    Earlier at one point I was going to have smrproxy use hazard pointer
    logic or qsbr logic as a config option, but the extra code complexity
    and the fact that qsbr required 2 grace periods kind of made that
    unfeasible.  The qsbr logic was mostly ripped out but there were still >>>> some pieces there.

    Anyway I'm working a c++ version which involves a lot of extra work
    besides just rewriting smrproxy.  There coming up with an api for
    proxies and testcases which tend to be more work than the code that
    they are testing.

    Damn! I almost missed this post! Fucking Thunderbird... Will get back
    to you. Working on something else right now Joe, thanks.

    https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/


    No problem.  The c++ work is progressing pretty slowly, not least in
    part because the documentation is not always clear as to what
    something does or even what problem it is supposed to solve.
    To think I took a pass on on rust because I though it was
    more complicated than it needed to be.

    Never even tried Rust, shit, I am behind the times. ;^)

    Humm... I don't think we can get 100% C++ because of the damn asymmetric membar for these rather "specialized" algorithms?

    Is C++ thinking about creating a standard way to gain an asymmetric membar?

    I don't think so. It's platform dependent. Apart from linux, mostly
    it's done with a call to some virtual memory function that flushes
    the TLBs (translation lookaside buffers) which involves IPI calls
    to all the processors and those have memory barriers. This is
    old, 1973, patent 3,947,823 cited by the patent I did.

    Anyway, I version the code so there's a asymmetric memory barrier
    version and an explicit memory barrier version, the latter
    being much slower.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Sun Oct 27 20:35:54 2024
    On 10/27/24 18:32, Chris M. Thomasson wrote:
    On 10/27/2024 3:29 PM, jseigh wrote:
    On 10/27/24 15:33, Chris M. Thomasson wrote:
    On 10/25/2024 3:56 PM, jseigh wrote:
    On 10/25/24 18:00, Chris M. Thomasson wrote:
    On 10/18/2024 5:07 AM, jseigh wrote:
    On 10/17/24 19:40, Chris M. Thomasson wrote:
    On 10/17/2024 2:08 PM, jseigh wrote:
    On 10/17/24 16:10, Chris M. Thomasson wrote:
    On 10/17/2024 5:10 AM, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now >>>>>>>>>> wait- free
    instead of mostly wait-free.  The reader lock logic after loading >>>>>>>>>> the address of the reader lock object into a register is now 2 >>>>>>>>>> instructions a load followed by a store.  The unlock is same >>>>>>>>>> as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting >>>>>>>>>> it to c++ and don't want to waste any more time on c version. >>>>>>>>>>
    No idea of it's a new algorithm.  I suspect that since I use >>>>>>>>>> the term epoch that it will be claimed that it's ebr, epoch >>>>>>>>>> based reclamation, and that all ebr algorithms are equivalent. >>>>>>>>>> Though I suppose you could argue it's qsbr if I point out what >>>>>>>>>> the quiescent states are.

    I have to take a look at it! Been really busy lately. Shit
    happens.


    There's a quick and dirty explanation at
    http://threadnought.wordpress.com/

    repo at https://github.com/jseigh/smrproxy

    I'll need to create some memory access diagrams that
    visualize how it works at some point.

    Anyway if it's new, another algorithm to use without
    attribution.

    Interesting. From a quick view, it kind of reminds me of a
    distributed seqlock for some reason. Are you using an asymmetric >>>>>>> membar in here? in smr_poll ?

    Yes, linux membarrier() in smr_poll.

    Not seqlock, not least for the reason that exiting the critical
    region
    is 3 instructions unless you use atomics which are expensive and have >>>>>> memory barriers usually.

    A lot of the qsbr and ebr reader lock/unlock code is going to look >>>>>> somewhat similar so you have to know how the reclaim logic uses it. >>>>>> In this case I am slingshotting off of the asymmetric memory barrier. >>>>>>
    Earlier at one point I was going to have smrproxy use hazard pointer >>>>>> logic or qsbr logic as a config option, but the extra code complexity >>>>>> and the fact that qsbr required 2 grace periods kind of made that
    unfeasible.  The qsbr logic was mostly ripped out but there were
    still
    some pieces there.

    Anyway I'm working a c++ version which involves a lot of extra work >>>>>> besides just rewriting smrproxy.  There coming up with an api for >>>>>> proxies and testcases which tend to be more work than the code that >>>>>> they are testing.

    Damn! I almost missed this post! Fucking Thunderbird... Will get
    back to you. Working on something else right now Joe, thanks.

    https://www.facebook.com/share/p/ydGSuPLDxjkY9TAQ/


    No problem.  The c++ work is progressing pretty slowly, not least in
    part because the documentation is not always clear as to what
    something does or even what problem it is supposed to solve.
    To think I took a pass on on rust because I though it was
    more complicated than it needed to be.

    Never even tried Rust, shit, I am behind the times. ;^)

    Humm... I don't think we can get 100% C++ because of the damn
    asymmetric membar for these rather "specialized" algorithms?

    Is C++ thinking about creating a standard way to gain an asymmetric
    membar?

    I don't think so.  It's platform dependent.  Apart from linux, mostly
    it's done with a call to some virtual memory function that flushes
    the TLBs (translation lookaside buffers) which involves IPI calls
    to all the processors and those have memory barriers.  This is
    old, 1973, patent 3,947,823 cited by the patent I did.

    Anyway, I version the code so there's a asymmetric memory barrier
    version and an explicit memory barrier version, the latter
    being much slower.

    Ahh, nice! acquire/release, no seq_cst, right? ;^)


    The membar version? That's a store/load membar so it is expensive.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Mon Oct 28 07:45:15 2024
    On 10/28/24 00:02, Chris M. Thomasson wrote:
    On 10/27/2024 5:35 PM, jseigh wrote:
    On 10/27/24 18:32, Chris M. Thomasson wrote:


    The membar version?  That's a store/load membar so it is expensive.

    I was wondering in your c++ version if you had to use any seq_cst
    barriers. I think acquire/release should be good enough. Now, when I say
    C++, I mean pure C++, no calls to FlushProcessWriteBuffers and things
    like that.

    I take it that your pure C++ version has no atomic RMW, right? Just
    loads and stores?

    While a lock action has acquire memory order semantics, if the
    implementation has internal stores, you have to those stores
    are complete before any access from the critical section.
    So you may need a store/load memory barrier.

    For cmpxchg, it has full cst_seq. For other rmw atomics I don't
    know. I have to ask on c.a. I think some data dependency and/or
    control dependency might factor in.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Mon Oct 28 21:17:55 2024
    On 10/28/24 17:57, Chris M. Thomasson wrote:
    On 10/28/2024 4:45 AM, jseigh wrote:
    On 10/28/24 00:02, Chris M. Thomasson wrote:
    On 10/27/2024 5:35 PM, jseigh wrote:
    On 10/27/24 18:32, Chris M. Thomasson wrote:


    The membar version?  That's a store/load membar so it is expensive.

    I was wondering in your c++ version if you had to use any seq_cst
    barriers. I think acquire/release should be good enough. Now, when I
    say C++, I mean pure C++, no calls to FlushProcessWriteBuffers and
    things like that.

    I take it that your pure C++ version has no atomic RMW, right? Just
    loads and stores?

    While a lock action has acquire memory order semantics, if the
    implementation has internal stores, you have to those stores
    are complete before any access from the critical section.
    So you may need a store/load memory barrier.

    Wrt acquiring a lock the only class of mutex logic that comes to mind
    that requires an explicit storeload style membar is Petersons, and some others along those lines, so to speak. This is for the store and load version. Now, RMW on x86 basically implies a StoreLoad wrt the LOCK
    prefix, XCHG aside for it has an implied LOCK prefix. For instance the original SMR algo requires a storeload as is on x86/x64. MFENCE or LOCK prefix.

    Fwiw, my experimental pure C++ proxy works fine with XADD, or atomic fetch-add. It needs an explicit membars (no #StoreLoad) on SPARC in RMO
    mode. On x86, the LOCK prefix handles that wrt the RMW's themselves.
    This is a lot different than using stores and loads. The original SMR
    and Peterson's algo needs that "store followed by a load to a different location" action to hold true, aka, storeload...

    Now, I don't think that a data-dependant load can act like a storeload.
    I thought that they act sort of like an acquire, aka #LoadStore |
    #LoadLoad wrt SPARC. SPARC in RMO mode honors data-dependencies. Now,
    the DEC Alpha is a different story... ;^)


    fwiw, here's the lock and unlock logic from smrproxy rewrite

    inline void lock()
    {
    epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
    _ref_epoch.store(_epoch, std::memory_order_relaxed);
    std::atomic_signal_fence(std::memory_order_acquire);
    }

    inline void unlock()
    {
    _ref_epoch.store(0, std::memory_order_release);
    }

    epoch_t is interesting. It's uint64_t but handles wrapped
    compares, ie. for an epoch_t x1 and uint64_t n

    x1 < (x1 + n)

    for any value of x1 and any value of n from 0 to 2**63;
    eg.
    0xfffffffffffffff0 < 0x0000000000000001


    The rewrite is almost complete except for some thread_local
    stuff. I think I might break off there. Most of the
    additional work is writing the test code. I'm considering
    rewriting it in Rust.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Tue Oct 29 07:27:40 2024
    On 10/29/24 00:38, Chris M. Thomasson wrote:
    On 10/28/2024 6:17 PM, jseigh wrote:
    On 10/28/24 17:57, Chris M. Thomasson wrote:
    On 10/28/2024 4:45 AM, jseigh wrote:


    fwiw, here's the lock and unlock logic from smrproxy rewrite

         inline void lock()
         {
             epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
             _ref_epoch.store(_epoch, std::memory_order_relaxed);
             std::atomic_signal_fence(std::memory_order_acquire);
         }

         inline void unlock()
         {
             _ref_epoch.store(0, std::memory_order_release);
         }

    epoch_t is interesting.  It's uint64_t but handles wrapped
    compares, ie. for an epoch_t x1 and uint64_t n

    Only your single polling thread can mutate the shadow_epoch, right?


    Yes. It's just an optimization. The reader threads could read
    from the global epoch but it would be in a separate cache line
    and be an extra dependent load. So one dependent load and
    same cache line.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Tue Oct 29 07:23:56 2024
    On 10/29/24 00:35, Chris M. Thomasson wrote:
    On 10/28/2024 6:17 PM, jseigh wrote:
    On 10/28/24 17:57, Chris M. Thomasson wrote:
    On 10/28/2024 4:45 AM, jseigh wrote:


    fwiw, here's the lock and unlock logic from smrproxy rewrite

         inline void lock()
         {
             epoch_t _epoch = shadow_epoch.load(std::memory_order_relaxed);
             _ref_epoch.store(_epoch, std::memory_order_relaxed);


             std::atomic_signal_fence(std::memory_order_acquire);
    ^^^^^^^^^^^^^^^^^^^^^^

         }

    Still don't know how your pure C++ write up can handle this without an std::atomic_thread_fence(std::memory_order_acquire).

    No thread fence is necessary. The loads can move before
    the store. They just can't move before the async
    membar. After that membar any previously retired
    objects are no longer reachable.



         inline void unlock()
         {
             _ref_epoch.store(0, std::memory_order_release);
         }


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Wed Oct 30 12:39:38 2024
    On 10/29/24 18:05, Chris M. Thomasson wrote:
    On 10/28/2024 9:41 PM, Chris M. Thomasson wrote:


    Ahhh, if you are using an async membar in your upcoming C++ version,
    then it would be fine. No problem. A compiler fence ala
    atomic_signal_fence, and the the explicit release, well, it will work. I don't see why it would not work.

    For some reason, I thought you were going to not use an async membar in
    your C++ version. Sorry. However, it still would be fun to test
    against... ;^)

    The C version has both versions. The C++ version does only the
    async member version. But I'm not publishing that code so it's
    a moot point.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Wed Oct 30 12:36:45 2024
    On 10/29/24 17:56, Chris M. Thomasson wrote:
    On 10/29/2024 4:27 AM, jseigh wrote:


    Yes.  It's just an optimization. The reader threads could read
    from the global epoch but it would be in a separate cache line
    and be an extra dependent load.  So one dependent load and
    same cache line.

    Are you taking advantage of the fancy alignment capabilities of C++?

    https://en.cppreference.com/w/cpp/language/alignas

    and friends? They seem to work fine wrt the last time I checked them.

    It's nice to have a standard way to pad and align on cache line
    boundaries. :^)

    It's target processor dependent. You need to query the cache line size
    and pass to compile as a define variable.

    There's supposed to be built in define that would be target system
    dependent but it's not implemented.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Wed Oct 30 12:43:09 2024
    On 10/30/24 02:54, Chris M. Thomasson wrote:
    On 10/17/2024 5:10 AM, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free
    instead of mostly wait-free.  The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.

    Joe, can you call dtors for nodes after a single epoch?

    Yes since I can check that epochs are being referenced
    directly instead of indirectly via quiescent states.
    It's the inferring from events vs. conditions thing.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Mon Nov 4 07:46:37 2024
    On 11/4/24 00:14, Chris M. Thomasson wrote:
    On 10/30/2024 9:39 AM, jseigh wrote:
    On 10/29/24 18:05, Chris M. Thomasson wrote:
    On 10/28/2024 9:41 PM, Chris M. Thomasson wrote:


    Ahhh, if you are using an async membar in your upcoming C++ version,
    then it would be fine. No problem. A compiler fence ala
    atomic_signal_fence, and the the explicit release, well, it will
    work. I don't see why it would not work.

    For some reason, I thought you were going to not use an async membar
    in your C++ version. Sorry. However, it still would be fun to test
    against... ;^)

    The C version has both versions.  The C++ version does only the
    async member version.  But I'm not publishing that code so it's
    a moot point.

    I got side tracked with more heavy math. The problem with C++ code that
    uses an async memory barrier is that its automatically rendered into a non-portable state... Yikes! Imvvvvvho, C/C++ should think about
    including them in some future standard. It would be nice. Well, for us
    at least! ;^)

    That's never going to happen. DWCAS has been around for more than
    50 years and c++ doesn't support that and probably never will.
    You can't write lock-free queues that are ABA free and
    are performant without that. So async memory barriers won't
    happen any time soon either.

    Long term I think c++ will fade into irrelevance along with
    all the other programming languages based on an imperfect
    knowledge of concurrency, which is basically all of them
    right now.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From [email protected]@21:1/5 to All on Mon Nov 4 14:09:14 2024
    On Mon, 4 Nov 2024 07:46:37 -0500
    jseigh <[email protected]> boring babbled:
    On 11/4/24 00:14, Chris M. Thomasson wrote:
    On 10/30/2024 9:39 AM, jseigh wrote:
    On 10/29/24 18:05, Chris M. Thomasson wrote:
    On 10/28/2024 9:41 PM, Chris M. Thomasson wrote:


    Ahhh, if you are using an async membar in your upcoming C++ version,
    then it would be fine. No problem. A compiler fence ala
    atomic_signal_fence, and the the explicit release, well, it will
    work. I don't see why it would not work.

    For some reason, I thought you were going to not use an async membar
    in your C++ version. Sorry. However, it still would be fun to test
    against... ;^)

    The C version has both versions.  The C++ version does only the
    async member version.  But I'm not publishing that code so it's
    a moot point.

    I got side tracked with more heavy math. The problem with C++ code that
    uses an async memory barrier is that its automatically rendered into a
    non-portable state... Yikes! Imvvvvvho, C/C++ should think about
    including them in some future standard. It would be nice. Well, for us
    at least! ;^)

    That's never going to happen. DWCAS has been around for more than
    50 years and c++ doesn't support that and probably never will.
    You can't write lock-free queues that are ABA free and
    are performant without that. So async memory barriers won't
    happen any time soon either.

    Long term I think c++ will fade into irrelevance along with
    all the other programming languages based on an imperfect
    knowledge of concurrency, which is basically all of them
    right now.

    Given most concurrent operating systems are written in these "imperfect" languages how does that square with your definition? And how would your
    perfect language run on them?

    Anyway, concurrency is the job of the OS, not the language. C++ threading is just a wrapper around pthreads on *nix and windows threads on Windows. The language just needs an interface to the underlying OS functionality, it should not try and implement the functionality itself as it'll always be a hack.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Thu Nov 21 15:17:32 2024
    On 10/17/24 08:10, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free
    instead of mostly wait-free.  The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.


    I got a port to c++ working now. There are 5 proxy implementations
    1) smrproxy v2
    2) arcproxy - reference counted proxy
    3) rwlock based proxy
    4) mutex based proxy
    5) an unsafe proxy with no locking

    The testcase is templated so you can use any of the
    5 proxy implementations without rewriting for each proxy
    type. You can do apple to apple comparisons. I
    realize that's the complete antithesis of current
    programming practices but there you have it. :)

    A bit of clean up and performance tuning now.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Sat Nov 23 11:10:46 2024
    On 11/21/24 15:17, jseigh wrote:
    On 10/17/24 08:10, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free
    instead of mostly wait-free.  The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.


    I got a port to c++ working now. There are 5 proxy implementations
    1) smrproxy v2
    2) arcproxy - reference counted proxy
    3) rwlock based proxy
    4) mutex based proxy
    5) an unsafe proxy with no locking

    The testcase is templated so you can use any of the
    5 proxy implementations without rewriting for each proxy
    type.  You can do apple to apple comparisons.  I
    realize that's the complete antithesis of current
    programming practices but there you have it.  :)

    A bit of clean up and performance tuning now.


    Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
    about what the C version was.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Sat Nov 23 17:12:56 2024
    On 11/23/24 16:31, Chris M. Thomasson wrote:
    On 11/23/2024 8:10 AM, jseigh wrote:
    On 11/21/24 15:17, jseigh wrote:
    On 10/17/24 08:10, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free
    instead of mostly wait-free.  The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.


    I got a port to c++ working now. There are 5 proxy implementations
    1) smrproxy v2
    2) arcproxy - reference counted proxy
    3) rwlock based proxy
    4) mutex based proxy
    5) an unsafe proxy with no locking

    The testcase is templated so you can use any of the
    5 proxy implementations without rewriting for each proxy
    type.  You can do apple to apple comparisons.  I
    realize that's the complete antithesis of current
    programming practices but there you have it.  :)

    A bit of clean up and performance tuning now.


    Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
    about what the C version was.

    Nice! Are you using pthread_getspecific or tss_get in you C version?


    Just thread_local.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Sun Nov 24 15:14:58 2024
    On 11/23/24 11:10, jseigh wrote:
    On 11/21/24 15:17, jseigh wrote:
    On 10/17/24 08:10, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free
    instead of mostly wait-free.  The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.


    I got a port to c++ working now. There are 5 proxy implementations
    1) smrproxy v2
    2) arcproxy - reference counted proxy
    3) rwlock based proxy
    4) mutex based proxy
    5) an unsafe proxy with no locking

    The testcase is templated so you can use any of the
    5 proxy implementations without rewriting for each proxy
    type.  You can do apple to apple comparisons.  I
    realize that's the complete antithesis of current
    programming practices but there you have it.  :)

    A bit of clean up and performance tuning now.


    Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
    about what the C version was.


    I've been using cpu time to measure performance. That's ok
    for lock-free/wait-free locking. For normal mutexes and
    shared locks, it doesn't measure wait time so those didn't
    look as bad as they really were. You can add logic
    to measure how long it takes to acquire a lock but that
    adds significant overhead.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Sun Nov 24 19:09:05 2024
    On 11/24/24 18:19, Chris M. Thomasson wrote:
    On 11/24/2024 12:14 PM, jseigh wrote:
    On 11/23/24 11:10, jseigh wrote:
    On 11/21/24 15:17, jseigh wrote:
    On 10/17/24 08:10, jseigh wrote:
    I replaced the hazard pointer logic in smrproxy.  It's now wait-free >>>>> instead of mostly wait-free.  The reader lock logic after loading
    the address of the reader lock object into a register is now 2
    instructions a load followed by a store.  The unlock is same
    as before, just a store.

    It's way faster now.

    It's on the feature/003 branch as a POC.   I'm working on porting
    it to c++ and don't want to waste any more time on c version.

    No idea of it's a new algorithm.  I suspect that since I use
    the term epoch that it will be claimed that it's ebr, epoch
    based reclamation, and that all ebr algorithms are equivalent.
    Though I suppose you could argue it's qsbr if I point out what
    the quiescent states are.


    I got a port to c++ working now. There are 5 proxy implementations
    1) smrproxy v2
    2) arcproxy - reference counted proxy
    3) rwlock based proxy
    4) mutex based proxy
    5) an unsafe proxy with no locking

    The testcase is templated so you can use any of the
    5 proxy implementations without rewriting for each proxy
    type.  You can do apple to apple comparisons.  I
    realize that's the complete antithesis of current
    programming practices but there you have it.  :)

    A bit of clean up and performance tuning now.


    Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
    about what the C version was.


    I've been using cpu time to measure performance. That's ok
    for lock-free/wait-free locking.  For normal mutexes and
    shared locks, it doesn't measure wait time so those didn't
    look as bad as they really were.  You can add logic
    to measure how long it takes to acquire a lock but that
    adds significant overhead.

    I remember back in the day when I was comparing and contrasting various lock/wait-free algorithms with their 100% lock-based counter parts. Some
    of the lock-based tests too so long that I just terminated the damn
    program. Iirc, a lock-free test would take around 5 minutes. The lock-
    based test would be around 30+ minutes. This was way back on c.p.t.

    I set the iteration count as a parameter. Mutex can be particularly
    slow with a lot of reader threads. I usually see about 1000 - 10000
    times slower than smrproxy. rwlocks aren't as bad, about 200 x
    slower.

    Mutex, rwlock, and arcproxy use interlocked instructions so you
    can get a really wide performance range based on cache geometry
    and processor sets you run on.

    Joe Seigh


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Mon Nov 25 06:31:32 2024
    On 11/24/24 19:47, Chris M. Thomasson wrote:
    On 11/24/2024 4:09 PM, jseigh wrote:

    ...


    Actually, I remember way back where a scenario that had a lot of writes
    would start to mess with a deferred reclamation wrt a polling thread
    type of “scheme”. Too many deferred nodes would start to "pile up". Basically, the single polling thread was having trouble keeping up with
    all of them. The interlocked versions seemed to perform sort of "better"
    in a sense during periods that had a lot of frequent “writes”. Of course clever use of node caches helps heavy write periods. Anyway, some of the tests just used a mutex for writes, others used lock-free and would
    generate high loads of them that would push and pop nodes and defer them
    to the poll thread to test how much load it (poll thread) could take.


    Using deferred reclamation to implement a lock-free queue will slow
    things down. Also cache line thrashing will limit how fast queue
    operations will go no matter what.

    Better to parallelize. You can take a really bad lock-free queue implementation, parallelize it, and then tell everyone how fast
    your queue is.

    Single queue at 1 op/sec.
    Parallel queue on 1000000 processors, 1000000 op/sec.


    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Wed Nov 27 10:29:05 2024
    On 11/24/24 15:14, jseigh wrote:
    On 11/23/24 11:10, jseigh wrote:
    On 11/21/24 15:17, jseigh wrote:


    Ok, smrproxy lock/unlock is down to 0.6 nanoseconds now,
    about what the C version was.


    I've been using cpu time to measure performance. That's ok
    for lock-free/wait-free locking.  For normal mutexes and
    shared locks, it doesn't measure wait time so those didn't
    look as bad as they really were.  You can add logic
    to measure how long it takes to acquire a lock but that
    adds significant overhead.


    Some timings with 128 reader threads

    unsafe 52.983 nsecs ( 0.000) 860.576 nsecs ( 0.000)
    smr 54.714 nsecs ( 1.732) 882.356 nsecs ( 21.780) smrlite 53.149 nsecs ( 0.166) 870.066 nsecs ( 9.490)
    arc 739.833 nsecs ( 686.850) 11,988.289 nsecs ( 11,127.713)
    rwlock 1,078.306 nsecs ( 1,025.323) 17,309.882 nsecs ( 16,449.306)
    mutex 3,203.034 nsecs ( 3,150.052) 51,479.407 nsecs ( 50,618.831)

    The first column is cpu time, third column is elapsed time.
    unsafe is without any synchronized reader access. The
    value in parentheses is the unsafe access time subtracted
    out to separate out the synchronization overheads. smrlite is
    smr proxy with thread_local overhead. So smrproxy lock/unlock
    by itself is about 0.1 - 0.2 nanoseconds.

    I'm going to drop working on the whole proxy interface thing. The
    application can decide if it wants to hardcode a dependency on a
    particular 3rd party libarary implementation or abstract it out
    to a more portable api.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Mon Dec 9 13:34:11 2024
    On 11/27/24 10:29, jseigh wrote:

    Some timings with 128 reader threads

    unsafe        52.983 nsecs (      0.000)     860.576 nsecs (      0.000)
    smr           54.714 nsecs (      1.732)     882.356 nsecs (     21.780)
    smrlite       53.149 nsecs (      0.166)     870.066 nsecs (      9.490)
    arc          739.833 nsecs (    686.850)  11,988.289 nsecs ( 11,127.713)
    rwlock     1,078.306 nsecs (  1,025.323)  17,309.882 nsecs ( 16,449.306)
    mutex      3,203.034 nsecs (  3,150.052)  51,479.407 nsecs ( 50,618.831)

    The first column is cpu time, third column is elapsed time.
    unsafe is without any synchronized reader access.   The
    value in parentheses is the unsafe access time subtracted
    out to separate out the synchronization overheads.  smrlite is
    smr proxy with thread_local overhead.  So smrproxy lock/unlock
    by itself is about 0.1 - 0.2 nanoseconds.

    I'm going to drop working on the whole proxy interface thing.  The application can decide if it wants to hardcode a dependency on a
    particular 3rd party libarary implementation or abstract it out
    to a more portable api.


    I figured out where the smr vs smrlite overhead is likely coming from.

    1) thread_local load about .3 nsecs, 2 for lock/unlock so .6 nsecs.
    2) overhead from lazy initialization, about .6 nsecs.

    smrlite most of the time doesn't show any measurable overhead,
    0 nsecs.

    Theoretically, you could do do lazy initialization with zero
    runtime overhead, but for most c++ apps, 1 millisecond is
    considered fast, so I don't think there would be much interest
    in it.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to jseigh on Wed Dec 11 07:13:30 2024
    On 12/9/24 13:34, jseigh wrote:
    On 11/27/24 10:29, jseigh wrote:

    I'm going to drop working on the whole proxy interface thing.  The
    application can decide if it wants to hardcode a dependency on a
    particular 3rd party libarary implementation or abstract it out
    to a more portable api.


    I figured out where the smr vs smrlite overhead is likely coming from.

    1) thread_local load about .3 nsecs, 2 for lock/unlock so .6 nsecs.
    2) overhead from lazy initialization, about .6 nsecs.

    smrlite most of the time doesn't show any measurable overhead,
    0 nsecs.

    Fixed the whole thread_local issue. Just don't use it. I
    went back to the api I was using in C and leave it up to
    the app to keep track of thread local objects. Fixes the
    problem of multiple proxies needing multiple per thread
    objects. This is a language issue, not an api issue.


    Theoretically, you could do do lazy initialization with zero
    runtime overhead, but for most c++ apps, 1 millisecond is
    considered fast, so I don't think there would be much interest
    in it.

    And apparantly I am right. :)

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Thu Dec 12 07:29:11 2024
    On 12/12/24 03:43, Chris M. Thomasson wrote:
    On 11/4/2024 4:46 AM, jseigh wrote:


    Hummm... If I remember correctly, you said something about using a
    simple atomic exchange to pop a whole list (lock-free stack), then
    simple reversing the list to get a fifo order? Do you remember any of
    that way back on c.p.t?

    That kind of stuff pre-dates c.p.t. even.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Thu Dec 12 19:48:12 2024
    On 12/12/24 16:55, Chris M. Thomasson wrote:
    On 12/12/2024 4:29 AM, jseigh wrote:
    On 12/12/24 03:43, Chris M. Thomasson wrote:
    On 11/4/2024 4:46 AM, jseigh wrote:


    Hummm... If I remember correctly, you said something about using a
    simple atomic exchange to pop a whole list (lock-free stack), then
    simple reversing the list to get a fifo order? Do you remember any of
    that way back on c.p.t?

    That kind of stuff pre-dates c.p.t. even.

    Has to. Although, it was not in the IBM principles of operation where
    the describe their lock-free stack in an appendix, iirc... I cannot
    remember the exact one right now. Iirc, it was under free pool
    manipulation?

    If you pop off a LIFO stack onto another LIFO, everything becomes FIFO.

    Also, there's a doubly linked stack where the back links are lazily initialized. That seems to get reinvented every so often.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)