[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YvuvxnfHIRUAuCrD@boqun-archlinux>
Date: Tue, 16 Aug 2022 07:55:02 -0700
From: Boqun Feng <boqun.feng@...il.com>
To: Will Deacon <will@...nel.org>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Herbert Xu <herbert@...dor.apana.org.au>,
Tejun Heo <tj@...nel.org>, marcan@...can.st,
peterz@...radead.org, jirislaby@...nel.org, maz@...nel.org,
mark.rutland@....com, catalin.marinas@....com, oneukum@...e.com,
roman.penyaev@...fitbricks.com, asahi@...ts.linux.dev,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH] workqueue: Fix memory ordering race in queue_work*()
On Tue, Aug 16, 2022 at 02:41:57PM +0100, Will Deacon wrote:
> On Mon, Aug 15, 2022 at 10:27:10PM -0700, Linus Torvalds wrote:
> > On Mon, Aug 15, 2022 at 9:15 PM Herbert Xu <herbert@...dor.apana.org.au> wrote:
> > >
> > > Please revert this as test_and_set_bit was always supposed to be
> > > a full memory barrier. This is an arch bug.
> >
> > Yes, the bitops are kind of strange for various legacy reasons:
> >
> > - set/clear_bit is atomic, but without a memory barrier, and need a
> > "smp_mb__before_atomic()" or similar for barriers
> >
> > - test_and_set/clear_bit() are atomic, _and_ are memory barriers
> >
> > - test_and_set_bit_lock and test_and_clear_bit_unlock are atomic and
> > _weaker_ than full memory barriers, but sufficient for locking (ie
> > acquire/release)
> >
> > Does any of this make any sense at all? No. But those are the
> > documented semantics exactly because people were worried about
> > test_and_set_bit being used for locking, since on x86 all the atomics
> > are also memory barriers.
> >
> > From looking at it, the asm-generic implementation is a bit
> > questionable, though. In particular, it does
> >
> > if (READ_ONCE(*p) & mask)
> > return 1;
> >
> > so it's *entirely* unordered for the "bit was already set" case.
> >
> > That looks very wrong to me, since it basically means that the
> > test_and_set_bit() can return "bit was already set" based on an old
> > value - not a memory barrier at all.
> >
> > So if you use "test_and_set_bit()" as some kind of "I've done my work,
> > now I am going to set the bit to tell people to pick it up", then that
> > early "bit was already set" code completely breaks it.
> >
> > Now, arguably our atomic bitop semantics are very very odd, and it
> > might be time to revisit them. But that code looks very very buggy to
> > me.
> >
> > The bug seems to go back to commit e986a0d6cb36 ("locking/atomics,
> > asm-generic/bitops/atomic.h: Rewrite using atomic_*() APIs"), and the
> > fix looks to be as simple as just removing that early READ_ONCE return
> > case (test_and_clear has the same bug).
> >
> > Will?
>
> Right, this looks like it's all my fault, so sorry about that.
>
> In an effort to replace the spinlock-based atomic bitops with a version
> based on atomic instructions in e986a0d6cb36, I inadvertently added this
> READ_ONCE() shortcut to test_and_set_bit() because at the time that's
> what we had (incorrectly) documented in our attempts at cleaning things
> up in this area. I confess that I have never been comfortable with the
> comment for test_and_set_bit() prior to my problematic patch:
>
> /**
> * test_and_set_bit - Set a bit and return its old value
> * @nr: Bit to set
> * @addr: Address to count from
> *
> * This operation is atomic and cannot be reordered.
> * It may be reordered on other architectures than x86.
> * It also implies a memory barrier.
> */
>
> so while Peter and I were trying to improve the documentation for
> atomics and memory barriers we clearly ended up making the wrong call
> trying to treat this like e.g. a cmpxchg() (which has the
> unordered-on-failure semantics).
>
> It's worth noting that with the spinlock-based implementation (i.e.
> prior to e986a0d6cb36) then we would have the same problem on
> architectures that implement spinlocks with acquire/release semantics;
> accesses from outside of the critical section can drift in and reorder
> with each other there, so the conversion looked legitimate to me in
> isolation and I vaguely remember going through callers looking for
> potential issues. Alas, I obviously missed this case.
>
I just to want to mention that although spinlock-based atomic bitops
don't provide the full barrier in test_and_set_bit(), but they don't
have the problem spotted by Hector, because test_and_set_bit() and
clear_bit() sync with each other via locks:
test_and_set_bit():
lock(..);
old = *p; // mask is already set by other test_and_set_bit()
*p = old | mask;
unlock(...);
clear_bit():
lock(..);
*p ~= mask;
unlock(..);
so "having a full barrier before test_and_set_bit()" may not be the
exact thing we need here, as long as a test_and_set_bit() can sync with
a clear_bit() uncondiontally, then the world is safe. For example, we
can make test_and_set_bit() RELEASE, and clear_bit() ACQUIRE on ARM64:
test_and_set_bit():
atomic_long_fetch_or_release(..); // pair with clear_bit()
// guarantee everything is
// observed.
clear_bit():
atomic_long_fetch_andnot_acquire(..);
, maybe that's somewhat cheaper than a full barrier implementation.
Thoughts? Just to find the exact ordering requirement for bitops.
Regards,
Boqun
> So it looks to me like we need to:
>
> 1. Upgrade test_and_{set,clear}_bit() to have a full memory barrier
> regardless of the value which is read from memory. The lock/unlock
> flavours can remain as-is.
>
> 2. Fix the documentation
>
> 3. Figure out what to do about architectures building atomics out of
> spinlocks (probably ok as lock+unlock == full barrier there?)
>
> 4. Accept my sincerest apologies for the mess!
>
> > IOW, the proper fix for this should, I think, look something like this
> > (whitespace mangled) thing:
> >
> > --- a/include/asm-generic/bitops/atomic.h
> > +++ b/include/asm-generic/bitops/atomic.h
> > @@ -39,9 +39,6 @@ arch_test_and_set_bit(
> > unsigned long mask = BIT_MASK(nr);
> >
> > p += BIT_WORD(nr);
> > - if (READ_ONCE(*p) & mask)
> > - return 1;
> > -
> > old = arch_atomic_long_fetch_or(mask, (atomic_long_t *)p);
> > return !!(old & mask);
> > }
> > @@ -53,9 +50,6 @@ arch_test_and_clear_bit
> > unsigned long mask = BIT_MASK(nr);
> >
> > p += BIT_WORD(nr);
> > - if (!(READ_ONCE(*p) & mask))
> > - return 0;
> > -
> > old = arch_atomic_long_fetch_andnot(mask, (atomic_long_t *)p);
> > return !!(old & mask);
> > }
> >
> > but the above is not just whitespace-damaged, it's entirely untested
> > and based purely on me looking at that code.
>
> Yes, I think that's step 1, thanks! I'm a bit worried about the perf
> numbers on the other thread, but we can get to the bottom of that
> separately.
>
> Will
Powered by blists - more mailing lists