• Re: lock-free alternative to atomic>

    From jseigh@21:1/5 to Chris M. Thomasson on Tue Aug 20 09:11:18 2024
    On 8/19/2024 4:32 PM, Chris M. Thomasson wrote:
    On 8/19/2024 7:22 AM, Bonita Montero wrote:
    I've developed a lock-free alternative to atomic<shared_ptr<>>. The idea is that the pointer inside this atomic has two counters which come along
    in a size_t'd word. The first counter has a width of 24 bit and counts
    the number of control block increments which are in progress, so I'm
    also using double reference counting. The remaing 40 bits of the second
    word contain a counter which constantly increments when the pointer is
    being overwritten. This prevents the lowerr 24 bit to be decremented
    if the pointer is already overwritten.
    The code is somewhat faster than the futex'd solution of MSVC and
    mangitudes faster than the clang's libc++ atomic<shared_ptr<>>.

    Here's the code:
    [...]

    Need to read it, but are you familiar with Joe Seighs atomic_ptr? Its been out for a long time.


    I need to update it with actual C++ atomics. I'll have to use the folly lib trick of packing a 48 bit ptr into 64 bits
    with 16 bits left over for a reference count since it's not likely C++ will support DWCAS. There's no actual technical
    reason for that. They need to ask the linux kernel folk and everyone implementing lock-free code with DWCAS how they
    manage to do that without atomic 128 bit loads. :)

    Anyway, that isn't very high on my list since reference counting isn't very performant. I'm working on more interesting
    stuff for now.

    As far as split reference counting goes, I think they are doing something like atomic_ptr without local_ptr. So yes,
    atomic_ptr to local atomic_ptr copies are expensive and they're doing some kind of convoluted batching to ameliorate
    that. Maybe.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Bonita Montero on Tue Aug 20 15:08:49 2024
    On 8/20/2024 9:21 AM, Bonita Montero wrote:
    Am 20.08.2024 um 15:11 schrieb jseigh:

    I need to update it with actual C++ atomics.  I'll have to use the folly lib trick of packing a 48 bit ptr into 64 bits with 16 bits left over for
    a reference count since it's not likely C++ will support DWCAS.

    Only having 16 bits is dangerous and doesn't work with newer x86-CPUs
    which use 55 bit for the userspace.


    That's seriously big. There's not enough memory to require something like that. Memory mapped files or i/o space.
    You'd have to be careful with your memory access patterns or paging overhead would kill your performance. I don't
    even want to know what the page table layout looks like. Anyway 9 bits would limit you to just 512 concurrent
    references, probably not enough.

    I have an cmpxchg16b based macro. I used to have powerpc and sparc but no longer. Hopefully arm based pc's that
    can run linux show up some time. Arm has some interesting atomics.

    It would be nice if there was an atomic library that just had the primitives and not some half baked implemenation of
    stdatomic with we don't want or need.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Scott Lurndal@21:1/5 to jseigh on Tue Aug 20 20:51:58 2024
    jseigh <[email protected]> writes:
    On 8/20/2024 9:21 AM, Bonita Montero wrote:
    Am 20.08.2024 um 15:11 schrieb jseigh:

    I need to update it with actual C++ atomics.  I'll have to use the folly lib trick of packing a 48 bit ptr into 64 bits with 16 bits left over for
    a reference count since it's not likely C++ will support DWCAS.

    Only having 16 bits is dangerous and doesn't work with newer x86-CPUs
    which use 55 bit for the userspace.


    That's seriously big. There's not enough memory to require something like that.

    There sure is. More then enough. DAGS "CXL-memory".

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Wed Aug 21 08:24:58 2024
    On 8/20/2024 11:29 PM, Chris M. Thomasson wrote:
    On 8/20/2024 8:28 PM, Bonita Montero wrote:
    Am 20.08.2024 um 21:27 schrieb Chris M. Thomasson:

    Wrt your atomic_ptr:

    struct anchor
    {
         void* pointer;
         word count;
    };

    Wrt to standard C++11 it should be lock-free on many arch's by now, aka, they have a real DWCAS. It can be checked with is_lock_free() wrt:

    g++ doesn't support 16 byte atomics which are lock-free, also not MSVC
    (but 8 byte lock-free atomics in 32 bit mode). But clang++ does support
    it. I use it with a "if constexpr( SUBSTITUTABLE )" in my dcas_atomic.h.

    It's been a while since I checked them, but damn it... C++11 should be using native DWCAS if the underlying arch supports it wrt (struct anchor). Ahh shit. Well, I need to check again.


    No, C++ won't. Their reasoning is they can only do it if a 128 bit atomic can be defined
    with all the operations that other atomic types have. But some architectures don't have
    128 bit atomic loads and/or stores (not performant ones anyway) so no go in their view.
    The idea that you could provide DWCAS some other way seems not to fit into their
    world view. And lock-free programming isn't important to them.



    https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free

    is_always_lock_free is better

    Touche. :^)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Bonita Montero on Thu Aug 22 11:17:03 2024
    On 8/21/2024 8:44 AM, Bonita Montero wrote:
    Am 21.08.2024 um 14:24 schrieb jseigh:

    No, C++ won't.  Their reasoning is they can only do it if a 128 bit atomic can be defined
    with all the operations that other atomic types have.  But some architectures don't have
    128 bit atomic loads and/or stores (not performant ones anyway) so no go in their view.
    The idea that you could provide DWCAS some other way seems not to fit into their
    world view.  And lock-free programming isn't important to them.

    It's not necessary to define an explicit atomic for that in the
    standard. clang++ supports lock-free atomics up to any 128 bit
    trivial type. That's what all compilers for x84 and ARM should
    support.

    Not sure what you are talking about. Clang documentation is only clear
    if you already know the answer. I tried using _Atomic __uint128_t and
    it didn't like it.

    Anyway I have a gcc inline DWCAS so as far as I am concerned C/C++
    support is still where it was in 2003, meaning I still have to use
    the same techniques I used back then.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Bonita Montero on Thu Aug 22 12:34:11 2024
    On 8/22/2024 11:46 AM, Bonita Montero wrote:
    Am 22.08.2024 um 17:17 schrieb jseigh:
    On 8/21/2024 8:44 AM, Bonita Montero wrote:
    Am 21.08.2024 um 14:24 schrieb jseigh:

    No, C++ won't.  Their reasoning is they can only do it if a 128 bit atomic can be defined
    with all the operations that other atomic types have.  But some architectures don't have
    128 bit atomic loads and/or stores (not performant ones anyway) so no go in their view.
    The idea that you could provide DWCAS some other way seems not to fit into their
    world view.  And lock-free programming isn't important to them.

    It's not necessary to define an explicit atomic for that in the
    standard. clang++ supports lock-free atomics up to any 128 bit
    trivial type. That's what all compilers for x84 and ARM should
    support.

    Not sure what you are talking about.  Clang documentation is only clear
    if you already know the answer.  I tried using _Atomic __uint128_t and
    it didn't like it.

    Anyway I have a gcc inline DWCAS so as far as I am concerned C/C++
    support is still where it was in 2003, meaning I still have to use
    the same techniques I used back then.

    I tested this with clang++-18:

    #include <iostream>
    #include <atomic>

    using namespace std;

    int main()
    {
            atomic<__int128> a;
            cout << sizeof(a) << endl;
            cout << a.is_lock_free() << endl;
    }

    This prints:

        16
        0

    So the atomic<__int128> is reported not to be lock-free, but
    it calls libatomic, which internally uses CMPXCHG16B. Seems
    the compiler thiks "I dont know what libatomic does, but I
    guess it's not lockfree", although it actually.



    Well, moot for now. It seems installing clang on my dev box
    killed it.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Thu Aug 22 16:03:34 2024
    On 8/22/2024 3:27 PM, Chris M. Thomasson wrote:
    On 8/22/2024 8:17 AM, jseigh wrote:

    Anyway I have a gcc inline DWCAS so as far as I am concerned C/C++
    support is still where it was in 2003, meaning I still have to use
    the same techniques I used back then.

    Does this logic ring a bell? I saw your DWCAS asm impl _way_ back on c.p.t, 20+ years now?

    ____________________
    .align 16
    .globl np_ac_i686_atomic_dwcas_fence
    np_ac_i686_atomic_dwcas_fence:
      pushl %esi
      pushl %ebx
      movl 16(%esp), %esi
      movl (%esi), %eax
      movl 4(%esi), %edx
      movl 20(%esp), %esi
      movl (%esi), %ebx
      movl 4(%esi), %ecx
      movl 12(%esp), %esi
      lock cmpxchg8b (%esi)
      jne np_ac_i686_atomic_dwcas_fence_fail
      xorl %eax, %eax
      popl %ebx
      popl %esi
      ret

    np_ac_i686_atomic_dwcas_fence_fail:
      movl 16(%esp), %esi
      movl %eax, (%esi)
      movl %edx, 4(%esi)
      movl $1, %eax
      popl %ebx
      popl %esi
    ret
    ____________________

    ;^)


    that would be for 32 bit cpus. I use gcc inline assembly. Also I update the expected value if the cas fails.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Thu Aug 22 21:26:07 2024
    On 8/22/2024 4:33 PM, Chris M. Thomasson wrote:
    On 8/22/2024 1:03 PM, jseigh wrote:
    that would be for 32 bit cpus.  I use gcc inline assembly.  Also I update the expected value if the cas fails.

    So does this wrt updating the expected value:

    np_ac_i686_atomic_dwcas_fence_fail:
        movl 16(%esp), %esi
        movl %eax, (%esi)
        movl %edx, 4(%esi)
        movl $1, %eax
        popl %ebx
        popl %esi
    ret

    I remember you teaching me about atomic_ptr _way_ back over on c.p.t 20+ years ago. I saw your original inline asm wrt DWCAS for GCC.

    From the atomic-ptr-plus project

    //------------------------------------------------------------------------------
    // __atomic_compare_exchange_16 --
    //
    // returns:
    // rc = 0, fail - xcmp updated
    // rc = 1, success - dest updated //------------------------------------------------------------------------------
    int __atomic_compare_exchange_16(__int128 * dest, __int128 * xcmp, __int128 xxchg, int m, int ms, int mf) {
    int rc;

    __asm__ __volatile__ (
    "lea %3, %%rsi ;\n" // address of exchange
    "mov 0(%%rsi), %%rbx ;\n" // exchange low
    "mov 8(%%rsi), %%rcx ;\n" // exchange high

    "mov %2, %%rsi ;\n" // comparand
    "mov 0(%%rsi), %%rax ;\n" // comparand low
    "mov 8(%%rsi), %%rdx ;\n" // comparand high

    "mov %1, %%rsi ;\n" // destination
    "lock cmpxchg16b (%%rsi) ;\n"
    "jz 1f ;\n"
    "mov %2, %%rsi ;\n" // comparand
    "mov %%rax, 0(%%rsi) ;\n" // comparand low
    "mov %%rdx, 8(%%rsi) ;\n" // comparand high
    "1: mov $0, %0 ;\n"
    "setz %b0 ;\n" // rc =
    : "=&a" (rc)
    : "m" (dest), "m" (xcmp), "m" (xxchg)
    : "cc", "memory", "rdx", "rbx", "rcx", "rsi"
    );

    return rc;
    }

    This is cut and pasted several times with some versions having bad offsets of 4 which
    won't work very well. I know since I grabbed one of the bad versions for something
    else I am working on. The last 3 parameters are ignored. They're just to document
    memory fencing.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Bonita Montero on Thu Oct 3 10:50:31 2024
    On 10/3/24 06:59, Bonita Montero wrote:
    I got an idea why split reference counting doesn't safely work: If one
    thread increments the pointer's reference count, gets suspended then
    and another thread moves the pointer to another atomic<shared_ptr<>>
    and "installs" it with a new local reference count a thread then might
    erase the pointer in its new position and deallocate then the object.
    This situation is not covered by split reference counting. I guess
    that's the reason while MSVC, libstdc++ and libc++ use futexes and
    don't supply a lock-free alternative.
    I've dropped my code and now I implemented a futex'd solution, but
    still with the mentioned shortcut when I assign a thsared_obj to a
    shared_obj that tests if both pointers are the same before the futex
    is locked.


    I never did figure out what split reference counting was all about.
    Even if you have endless amount of time to kill by looking at the
    source, it still may not tell you what need to know. There's
    "how you use this" and what not to use it for, i.e. the semantics.

    A corollary of the burden of proof principle. If someone can't
    be bothered to explain something, I don't feel any obligation
    to figure out what the heck they are talking about.

    Besides, I already have an atomic shared pointer implantation
    that doesn't have those problems. It over 2 decades old at
    this point. And the differential reference counting it's
    based on is over 3 decades old.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Bonita Montero on Thu Oct 3 12:29:37 2024
    On 10/3/24 11:35, Bonita Montero wrote:
    Am 03.10.2024 um 16:50 schrieb jseigh:

    I never did figure out what split reference counting was all about.
    Even if you have endless amount of time to kill by looking at the
    source, it still may not tell you what need to know.  There's
    "how you use this" and what not to use it for, i.e. the semantics.

    https://www.youtube.com/results?search_query=lockfree+atomic+shared_ptr

    A corollary of the burden of proof principle.  If someone can't
    be bothered to explain something, I don't feel any obligation
    to figure out what the heck they are talking about.

    It't so complicated that I got to the afforementioned conclusion
    very late.

    Besides, I already have an atomic shared pointer implantation
    that doesn't have those problems.

    With hazard pointers that's still possible, but not otherwise.

    Don't take my word for it then. https://patents.google.com/patent/US7769791B2/en

    Which also shows you can patent prior art.

    Yes, it is well known, or maybe not so well known, that you can
    implement atomic lock-free reference counting with deferred
    reclamation, e.g. hazard pointers, rcu, garbage collection,
    etc... A lot faster for read access that doesn't need to
    change the reference count.


    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Bonita Montero on Fri Oct 4 11:39:45 2024
    On 10/3/24 12:59, Bonita Montero wrote:
    Am 03.10.2024 um 18:29 schrieb jseigh:

    With hazard pointers that's still possible, but not otherwise.

    Don't take my word for it then.
    https://patents.google.com/patent/US7769791B2/en
    Which also shows you can patent prior art.

    Maybe hazard pointers were invented 2005.

    Yes, it is well known, or maybe not so well known, that you
    can implement atomic lock-free reference counting with deferred
    reclamation, e.g. hazard pointers, rcu, garbage collection,
    etc...  A lot faster for read access that doesn't need to
    change the reference count.

    Look at the code which copies a tshared_obj<T> to a shared_obj<T>:

    template<typename T>
    shared_obj<T> &shared_obj<T>::operator =( tshared_obj<T> const &tso ) noexcept
    {
        using namespace std;
        ctrl_t *ctrl = tso.m_ctrl.load( memory_order_relaxed );
        if( ctrl == m_ctrl )
            return *this;
        if( m_ctrl )
        {
            if( m_ctrl->decr() == 1 )
                delete m_ctrl;
            m_ctrl = nullptr;
        }
        lock_guard<futex> lock( tso.m_ftx );
        ctrl = tso.m_ctrl.load( memory_order_relaxed );
        if( ctrl == m_ctrl )
            return *this;
        m_ctrl = ctrl;
        ctrl->incr();
        return *this;
    }


    The shortcut in the beginning compares the two pointers before locking
    the futex. So if the central tshared_ptr<> hasn't changed the whole
    operation of updating the shared_obj<> is just a nanosecond and all participating cachelines stay in shared mode. With that you need no
    hazard pointers, no RCU or whatever and you have immediate object
    destruction without any delay.
    I think it would be the best to suggest an assignment-operator for shared_ptr<> which takes an atomic<sharded_ptr<>> which also has this
    simple optimization.
    The idea is so brain damaged simple and makes all more advanced techno- logies for RCU-like patterns gratitious.


    I'm guessing tshared_obj<T> is a non-shared local variable. So it's
    just transfering a reference count from it to the new shared pointer.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jseigh@21:1/5 to Chris M. Thomasson on Sat Oct 5 09:11:15 2024
    On 10/4/24 16:00, Chris M. Thomasson wrote:
    On 10/4/2024 12:50 PM, Bonita Montero wrote:
    Am 04.10.2024 um 21:40 schrieb Chris M. Thomasson:

    I assume you know exactly how SMR works.

    Maybe I'll use hazard pointers if C++26 supports it. But as I've
    shown this is not necessary for atomic<shared_ptr<>> with this
    small optimization.


    Actually, split reference counting is strange to me because atomic_ptr
    has already been out there for a long time and works fine.

    Split reference counting is just premature optimization of
    differential reference counting. It just makes it more
    complicated than it has to be and does a pretty good job
    of confusing a lot of people, case in point.

    There's an insurance term called attractive nuisance for
    stuff that just going to add extra liability and your
    insurance rates are going to be higher.

    The two examples I can think of off the top of my head
    for compsci are reference counting and ring buffers.
    There should be bright yellow OSHA all over these
    when they teach it.

    Reference counters and ring buffers are the yard
    trampolines and lawn darts of computer programming.

    Joe Seigh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)