Forum: >>> Magnum BBS <<<

Re: lock-free alternative to atomic>

From jseigh@21:1/5 to Chris M. Thomasson on Tue Aug 20 09:11:18 2024

On 8/19/2024 4:32 PM, Chris M. Thomasson wrote:

On 8/19/2024 7:22 AM, Bonita Montero wrote:

I've developed a lock-free alternative to atomic<shared_ptr<>>. The idea is that the pointer inside this atomic has two counters which come along
in a size_t'd word. The first counter has a width of 24 bit and counts
the number of control block increments which are in progress, so I'm
also using double reference counting. The remaing 40 bits of the second
word contain a counter which constantly increments when the pointer is
being overwritten. This prevents the lowerr 24 bit to be decremented
if the pointer is already overwritten.
The code is somewhat faster than the futex'd solution of MSVC and
mangitudes faster than the clang's libc++ atomic<shared_ptr<>>.

Here's the code:

[...]

Need to read it, but are you familiar with Joe Seighs atomic_ptr? Its been out for a long time.

I need to update it with actual C++ atomics. I'll have to use the folly lib trick of packing a 48 bit ptr into 64 bits
with 16 bits left over for a reference count since it's not likely C++ will support DWCAS. There's no actual technical
reason for that. They need to ask the linux kernel folk and everyone implementing lock-free code with DWCAS how they
manage to do that without atomic 128 bit loads. :)

Anyway, that isn't very high on my list since reference counting isn't very performant. I'm working on more interesting
stuff for now.

As far as split reference counting goes, I think they are doing something like atomic_ptr without local_ptr. So yes,
atomic_ptr to local atomic_ptr copies are expensive and they're doing some kind of convoluted batching to ameliorate
that. Maybe.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Bonita Montero on Tue Aug 20 15:08:49 2024

On 8/20/2024 9:21 AM, Bonita Montero wrote:

Am 20.08.2024 um 15:11 schrieb jseigh:

I need to update it with actual C++ atomics. I'll have to use the folly lib trick of packing a 48 bit ptr into 64 bits with 16 bits left over for
a reference count since it's not likely C++ will support DWCAS.

Only having 16 bits is dangerous and doesn't work with newer x86-CPUs
which use 55 bit for the userspace.

That's seriously big. There's not enough memory to require something like that. Memory mapped files or i/o space.
You'd have to be careful with your memory access patterns or paging overhead would kill your performance. I don't
even want to know what the page table layout looks like. Anyway 9 bits would limit you to just 512 concurrent
references, probably not enough.

I have an cmpxchg16b based macro. I used to have powerpc and sparc but no longer. Hopefully arm based pc's that
can run linux show up some time. Arm has some interesting atomics.

It would be nice if there was an atomic library that just had the primitives and not some half baked implemenation of
stdatomic with we don't want or need.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to jseigh on Tue Aug 20 20:51:58 2024

jseigh <[email protected]> writes:

On 8/20/2024 9:21 AM, Bonita Montero wrote:

Am 20.08.2024 um 15:11 schrieb jseigh:

I need to update it with actual C++ atomics. I'll have to use the folly lib trick of packing a 48 bit ptr into 64 bits with 16 bits left over for
a reference count since it's not likely C++ will support DWCAS.

Only having 16 bits is dangerous and doesn't work with newer x86-CPUs
which use 55 bit for the userspace.

That's seriously big. There's not enough memory to require something like that.

There sure is. More then enough. DAGS "CXL-memory".

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Wed Aug 21 08:24:58 2024

On 8/20/2024 11:29 PM, Chris M. Thomasson wrote:

On 8/20/2024 8:28 PM, Bonita Montero wrote:

Am 20.08.2024 um 21:27 schrieb Chris M. Thomasson:

Wrt your atomic_ptr:

struct anchor
{
void* pointer;
word count;
};

Wrt to standard C++11 it should be lock-free on many arch's by now, aka, they have a real DWCAS. It can be checked with is_lock_free() wrt:

g++ doesn't support 16 byte atomics which are lock-free, also not MSVC
(but 8 byte lock-free atomics in 32 bit mode). But clang++ does support
it. I use it with a "if constexpr( SUBSTITUTABLE )" in my dcas_atomic.h.

It's been a while since I checked them, but damn it... C++11 should be using native DWCAS if the underlying arch supports it wrt (struct anchor). Ahh shit. Well, I need to check again.

No, C++ won't. Their reasoning is they can only do it if a 128 bit atomic can be defined
with all the operations that other atomic types have. But some architectures don't have
128 bit atomic loads and/or stores (not performant ones anyway) so no go in their view.
The idea that you could provide DWCAS some other way seems not to fit into their
world view. And lock-free programming isn't important to them.

https://en.cppreference.com/w/cpp/atomic/atomic/is_lock_free

is_always_lock_free is better

Touche. :^)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Bonita Montero on Thu Aug 22 11:17:03 2024

On 8/21/2024 8:44 AM, Bonita Montero wrote:

Am 21.08.2024 um 14:24 schrieb jseigh:

No, C++ won't. Their reasoning is they can only do it if a 128 bit atomic can be defined
with all the operations that other atomic types have. But some architectures don't have
128 bit atomic loads and/or stores (not performant ones anyway) so no go in their view.
The idea that you could provide DWCAS some other way seems not to fit into their
world view. And lock-free programming isn't important to them.

It's not necessary to define an explicit atomic for that in the
standard. clang++ supports lock-free atomics up to any 128 bit
trivial type. That's what all compilers for x84 and ARM should
support.

Not sure what you are talking about. Clang documentation is only clear
if you already know the answer. I tried using _Atomic __uint128_t and
it didn't like it.

Anyway I have a gcc inline DWCAS so as far as I am concerned C/C++
support is still where it was in 2003, meaning I still have to use
the same techniques I used back then.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Bonita Montero on Thu Aug 22 12:34:11 2024

On 8/22/2024 11:46 AM, Bonita Montero wrote:

Am 22.08.2024 um 17:17 schrieb jseigh:

On 8/21/2024 8:44 AM, Bonita Montero wrote:

Am 21.08.2024 um 14:24 schrieb jseigh:

No, C++ won't. Their reasoning is they can only do it if a 128 bit atomic can be defined
with all the operations that other atomic types have. But some architectures don't have
128 bit atomic loads and/or stores (not performant ones anyway) so no go in their view.
The idea that you could provide DWCAS some other way seems not to fit into their
world view. And lock-free programming isn't important to them.

It's not necessary to define an explicit atomic for that in the
standard. clang++ supports lock-free atomics up to any 128 bit
trivial type. That's what all compilers for x84 and ARM should
support.

Not sure what you are talking about. Clang documentation is only clear
if you already know the answer. I tried using _Atomic __uint128_t and
it didn't like it.

Anyway I have a gcc inline DWCAS so as far as I am concerned C/C++
support is still where it was in 2003, meaning I still have to use
the same techniques I used back then.

I tested this with clang++-18:

#include <iostream>
#include <atomic>

using namespace std;

int main()
{
        atomic<__int128> a;
        cout << sizeof(a) << endl;
        cout << a.is_lock_free() << endl;
}

This prints:

    16
    0

So the atomic<__int128> is reported not to be lock-free, but
it calls libatomic, which internally uses CMPXCHG16B. Seems
the compiler thiks "I dont know what libatomic does, but I
guess it's not lockfree", although it actually.

Well, moot for now. It seems installing clang on my dev box
killed it.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Thu Aug 22 16:03:34 2024

On 8/22/2024 3:27 PM, Chris M. Thomasson wrote:

On 8/22/2024 8:17 AM, jseigh wrote:

Anyway I have a gcc inline DWCAS so as far as I am concerned C/C++
support is still where it was in 2003, meaning I still have to use
the same techniques I used back then.

Does this logic ring a bell? I saw your DWCAS asm impl _way_ back on c.p.t, 20+ years now?

____________________
.align 16
.globl np_ac_i686_atomic_dwcas_fence
np_ac_i686_atomic_dwcas_fence:
pushl %esi
pushl %ebx
movl 16(%esp), %esi
movl (%esi), %eax
movl 4(%esi), %edx
movl 20(%esp), %esi
movl (%esi), %ebx
movl 4(%esi), %ecx
movl 12(%esp), %esi
lock cmpxchg8b (%esi)
jne np_ac_i686_atomic_dwcas_fence_fail
xorl %eax, %eax
popl %ebx
popl %esi
ret

np_ac_i686_atomic_dwcas_fence_fail:
movl 16(%esp), %esi
movl %eax, (%esi)
movl %edx, 4(%esi)
movl $1, %eax
popl %ebx
popl %esi
ret
____________________

;^)

that would be for 32 bit cpus. I use gcc inline assembly. Also I update the expected value if the cas fails.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Thu Aug 22 21:26:07 2024

On 8/22/2024 4:33 PM, Chris M. Thomasson wrote:

On 8/22/2024 1:03 PM, jseigh wrote:

that would be for 32 bit cpus. I use gcc inline assembly. Also I update the expected value if the cas fails.

So does this wrt updating the expected value:

np_ac_i686_atomic_dwcas_fence_fail:
    movl 16(%esp), %esi
    movl %eax, (%esi)
    movl %edx, 4(%esi)
    movl $1, %eax
    popl %ebx
    popl %esi
ret

I remember you teaching me about atomic_ptr _way_ back over on c.p.t 20+ years ago. I saw your original inline asm wrt DWCAS for GCC.

From the atomic-ptr-plus project

//------------------------------------------------------------------------------
// __atomic_compare_exchange_16 --
//
// returns:
// rc = 0, fail - xcmp updated
// rc = 1, success - dest updated //------------------------------------------------------------------------------
int __atomic_compare_exchange_16(__int128 * dest, __int128 * xcmp, __int128 xxchg, int m, int ms, int mf) {
int rc;

__asm__ __volatile__ (
"lea %3, %%rsi ;\n" // address of exchange
"mov 0(%%rsi), %%rbx ;\n" // exchange low
"mov 8(%%rsi), %%rcx ;\n" // exchange high

"mov %2, %%rsi ;\n" // comparand
"mov 0(%%rsi), %%rax ;\n" // comparand low
"mov 8(%%rsi), %%rdx ;\n" // comparand high

"mov %1, %%rsi ;\n" // destination
"lock cmpxchg16b (%%rsi) ;\n"
"jz 1f ;\n"
"mov %2, %%rsi ;\n" // comparand
"mov %%rax, 0(%%rsi) ;\n" // comparand low
"mov %%rdx, 8(%%rsi) ;\n" // comparand high
"1: mov $0, %0 ;\n"
"setz %b0 ;\n" // rc =
: "=&a" (rc)
: "m" (dest), "m" (xcmp), "m" (xxchg)
: "cc", "memory", "rdx", "rbx", "rcx", "rsi"
);

return rc;
}

This is cut and pasted several times with some versions having bad offsets of 4 which
won't work very well. I know since I grabbed one of the bad versions for something
else I am working on. The last 3 parameters are ignored. They're just to document
memory fencing.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Bonita Montero on Thu Oct 3 10:50:31 2024

On 10/3/24 06:59, Bonita Montero wrote:

I got an idea why split reference counting doesn't safely work: If one
thread increments the pointer's reference count, gets suspended then
and another thread moves the pointer to another atomic<shared_ptr<>>
and "installs" it with a new local reference count a thread then might
erase the pointer in its new position and deallocate then the object.
This situation is not covered by split reference counting. I guess
that's the reason while MSVC, libstdc++ and libc++ use futexes and
don't supply a lock-free alternative.
I've dropped my code and now I implemented a futex'd solution, but
still with the mentioned shortcut when I assign a thsared_obj to a
shared_obj that tests if both pointers are the same before the futex
is locked.

I never did figure out what split reference counting was all about.
Even if you have endless amount of time to kill by looking at the
source, it still may not tell you what need to know. There's
"how you use this" and what not to use it for, i.e. the semantics.

A corollary of the burden of proof principle. If someone can't
be bothered to explain something, I don't feel any obligation
to figure out what the heck they are talking about.

Besides, I already have an atomic shared pointer implantation
that doesn't have those problems. It over 2 decades old at
this point. And the differential reference counting it's
based on is over 3 decades old.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Bonita Montero on Thu Oct 3 12:29:37 2024

On 10/3/24 11:35, Bonita Montero wrote:

Am 03.10.2024 um 16:50 schrieb jseigh:

I never did figure out what split reference counting was all about.
Even if you have endless amount of time to kill by looking at the
source, it still may not tell you what need to know. There's
"how you use this" and what not to use it for, i.e. the semantics.

https://www.youtube.com/results?search_query=lockfree+atomic+shared_ptr

A corollary of the burden of proof principle. If someone can't
be bothered to explain something, I don't feel any obligation
to figure out what the heck they are talking about.

It't so complicated that I got to the afforementioned conclusion
very late.

Besides, I already have an atomic shared pointer implantation
that doesn't have those problems.

With hazard pointers that's still possible, but not otherwise.

Don't take my word for it then. https://patents.google.com/patent/US7769791B2/en

Which also shows you can patent prior art.

Yes, it is well known, or maybe not so well known, that you can
implement atomic lock-free reference counting with deferred
reclamation, e.g. hazard pointers, rcu, garbage collection,
etc... A lot faster for read access that doesn't need to
change the reference count.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Bonita Montero on Fri Oct 4 11:39:45 2024

On 10/3/24 12:59, Bonita Montero wrote:

Am 03.10.2024 um 18:29 schrieb jseigh:

With hazard pointers that's still possible, but not otherwise.

Don't take my word for it then.
https://patents.google.com/patent/US7769791B2/en
Which also shows you can patent prior art.

Maybe hazard pointers were invented 2005.

Yes, it is well known, or maybe not so well known, that you
can implement atomic lock-free reference counting with deferred
reclamation, e.g. hazard pointers, rcu, garbage collection,
etc... A lot faster for read access that doesn't need to
change the reference count.

Look at the code which copies a tshared_obj<T> to a shared_obj<T>:

template<typename T>
shared_obj<T> &shared_obj<T>::operator =( tshared_obj<T> const &tso ) noexcept
{
    using namespace std;
    ctrl_t *ctrl = tso.m_ctrl.load( memory_order_relaxed );
    if( ctrl == m_ctrl )
        return *this;
    if( m_ctrl )
    {
        if( m_ctrl->decr() == 1 )
            delete m_ctrl;
        m_ctrl = nullptr;
    }
    lock_guard<futex> lock( tso.m_ftx );
    ctrl = tso.m_ctrl.load( memory_order_relaxed );
    if( ctrl == m_ctrl )
        return *this;
    m_ctrl = ctrl;
    ctrl->incr();
    return *this;
}

The shortcut in the beginning compares the two pointers before locking
the futex. So if the central tshared_ptr<> hasn't changed the whole
operation of updating the shared_obj<> is just a nanosecond and all participating cachelines stay in shared mode. With that you need no
hazard pointers, no RCU or whatever and you have immediate object
destruction without any delay.
I think it would be the best to suggest an assignment-operator for shared_ptr<> which takes an atomic<sharded_ptr<>> which also has this
simple optimization.
The idea is so brain damaged simple and makes all more advanced techno- logies for RCU-like patterns gratitious.

I'm guessing tshared_obj<T> is a non-shared local variable. So it's
just transfering a reference count from it to the new shared pointer.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jseigh@21:1/5 to Chris M. Thomasson on Sat Oct 5 09:11:15 2024

On 10/4/24 16:00, Chris M. Thomasson wrote:

On 10/4/2024 12:50 PM, Bonita Montero wrote:

Am 04.10.2024 um 21:40 schrieb Chris M. Thomasson:

I assume you know exactly how SMR works.

Maybe I'll use hazard pointers if C++26 supports it. But as I've
shown this is not necessary for atomic<shared_ptr<>> with this
small optimization.

Actually, split reference counting is strange to me because atomic_ptr
has already been out there for a long time and works fine.

Split reference counting is just premature optimization of
differential reference counting. It just makes it more
complicated than it has to be and does a pretty good job
of confusing a lot of people, case in point.

There's an insurance term called attractive nuisance for
stuff that just going to add extra liability and your
insurance rates are going to be higher.

The two examples I can think of off the top of my head
for compsci are reference counting and ring buffers.
There should be bright yellow OSHA all over these
when they teach it.

Reference counters and ring buffers are the yard
trampolines and lawn darts of computer programming.

Joe Seigh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Rixter
  Wed Jul 29 02:00:40 2026
  from Madison, Nc via Telnet
- Centurion
  Tue Jul 28 22:54:59 2026
  from Berea, Ohio via Telnet
- Bob Worm
  Tue Jul 28 16:01:18 2026
  from Wales, Uk via Telnet
- Rixter
  Tue Jul 28 13:42:46 2026
  from Madison, Nc via Telnet
- Krenn
  Tue Jul 28 11:59:57 2026
  from Sydney, Nsw via Telnet
- Rixter
  Tue Jul 28 01:23:48 2026
  from Madison, Nc via Telnet
- Centurion
  Mon Jul 27 22:50:42 2026
  from Berea, Ohio via Telnet
- Ataricrypt
  Mon Jul 27 19:19:17 2026
  from England via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	741
Nodes:	16 (2 / 14)
Uptime:	65:20:11
Calls:	12,446
Calls today:	1
Files:	15,194
Messages:	6,537,548

Re: lock-free alternative to atomic>

Who's Online

Recent Visitors

System Info