[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADUfDZoWjSHRVFnxh=QK7HkQSB5zBZvePjkV-+2Esrf8jtnekg@mail.gmail.com>
Date: Tue, 2 Dec 2025 20:34:11 -0800
From: Caleb Sander Mateos <csander@...estorage.com>
To: Stephen Zhang <starzhangzsd@...il.com>
Cc: Andreas Gruenbacher <agruenba@...hat.com>, Christoph Hellwig <hch@....de>, Johannes.Thumshirn@....com,
ming.lei@...hat.com, hsiangkao@...ux.alibaba.com, colyli@...as.com,
linux-block@...r.kernel.org, linux-bcache@...r.kernel.org,
linux-kernel@...r.kernel.org, zhangshida@...inos.cn
Subject: Re: [PATCH v4 3/3] block: prevent race condition on bi_status in __bio_chain_endio
On Tue, Dec 2, 2025 at 7:10 PM Stephen Zhang <starzhangzsd@...il.com> wrote:
>
> Stephen Zhang <starzhangzsd@...il.com> 于2025年12月3日周三 09:51写道:
> >
> > Andreas Gruenbacher <agruenba@...hat.com> 于2025年12月3日周三 05:15写道:
> > >
> > > On Tue, Dec 2, 2025 at 6:48 AM Christoph Hellwig <hch@....de> wrote:
> > > > On Mon, Dec 01, 2025 at 02:07:07PM +0100, Andreas Gruenbacher wrote:
> > > > > On Mon, Dec 1, 2025 at 12:25 PM Christoph Hellwig <hch@...radead.org> wrote:
> > > > > > On Mon, Dec 01, 2025 at 11:22:32AM +0100, Andreas Gruenbacher wrote:
> > > > > > > > - if (bio->bi_status && !parent->bi_status)
> > > > > > > > - parent->bi_status = bio->bi_status;
> > > > > > > > + if (bio->bi_status)
> > > > > > > > + cmpxchg(&parent->bi_status, 0, bio->bi_status);
> > > > > > >
> > > > > > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > > > > > the chained bios, bi_status is initialized to 0, and it is only set
> > > > > > > again (to a non-0 value) when a failure occurs. When there are
> > > > > > > multiple failures, we only need to make sure that one of those
> > > > > > > failures is eventually reported, but for that, a simple assignment is
> > > > > > > enough here.
> > > > > >
> > > > > > A simple assignment doesn't guarantee atomicy.
> > > > >
> > > > > Well, we've already discussed that bi_status is a single byte and so
> > > > > tearing won't be an issue. Otherwise, WRITE_ONCE() would still be
> > > > > enough here.
> > > >
> > > > No. At least older alpha can tear byte updates as they need a
> > > > read-modify-write cycle.
> > >
> > > I know this used to be a thing in the past, but to see that none of
> > > that is relevant anymore today, have a look at where [*] quotes the
> > > C11 standard:
> > >
> > > memory location
> > > either an object of scalar type, or a maximal sequence
> > > of adjacent bit-fields all having nonzero width
> > >
> > > NOTE 1: Two threads of execution can update and access
> > > separate memory locations without interfering with
> > > each other.
> > >
> > > NOTE 2: A bit-field and an adjacent non-bit-field member
> > > are in separate memory locations. The same applies
> > > to two bit-fields, if one is declared inside a nested
> > > structure declaration and the other is not, or if the two
> > > are separated by a zero-length bit-field declaration,
> > > or if they are separated by a non-bit-field member
> > > declaration. It is not safe to concurrently update two
> > > bit-fields in the same structure if all members declared
> > > between them are also bit-fields, no matter what the
> > > sizes of those intervening bit-fields happen to be.
> > >
> > > [*] Documentation/memory-barriers.txt
> > >
> > > > But even on normal x86 the check and the update would be racy.
> > >
> > > There is no check and update (RMW), though. Quoting what I wrote
> > > earlier in this thread:
> > >
> > > On Mon, Dec 1, 2025 at 11:22 AM Andreas Gruenbacher <agruenba@...hat.com> wrote:
> > > > Hmm. I don't think cmpxchg() actually is of any value here: for all
> > > > the chained bios, bi_status is initialized to 0, and it is only set
> > > > again (to a non-0 value) when a failure occurs. When there are
> > > > multiple failures, we only need to make sure that one of those
> > > > failures is eventually reported, but for that, a simple assignment is
> > > > enough here. The cmpxchg() won't guarantee that a specific error value
> > > > will survive; it all still depends on the timing. The cmpxchg() only
> > > > makes it look like something special is happening here with respect to
> > > > ordering.
> > >
> > > So with or without the cmpxchg(), if there are multiple errors, we
> > > won't know which bi_status code will survive, but we do know that we
> > > will end up with one of those error codes.
> > >
> >
> > Thank you for sharing your insights—I found the discussion very enlightening.
> >
> > While I agree with Andreas’s perspective, I also very much appreciate
> > the clarity
> > and precision offered by the cmpxchg() approach. That’s why when Christoph
> > suggested it, I was happy to incorporate it into the code.
> >
> > But a cmpxchg is a little bit redundant here.
> > so we will change it to the simple assignment:
> >
> > - if (bio->bi_status && !parent->bi_status)
> > parent->bi_status = bio->bi_status;
> > + if (bio->bi_status)
> > parent->bi_status = bio->bi_status;
> >
> > I will integrate this discussion into the commit message, it is very insightful.
> >
>
> Hi,
>
> I’ve been reconsidering the two approaches for the upcoming patch revision.
> Essentially, we’re comparing two methods:
> A:
> if (bio->bi_status)
> parent->bi_status = bio->bi_status;
> B:
> if (bio->bi_status)
> cmpxchg(&parent->bi_status, 0, bio->bi_status);
>
> Both appear correct, but B seems a little bit redundant here.
> Upon further reflection, I’ve noticed a subtle difference:
> A unconditionally writes to parent->bi_status when bio->bi_status is non-zero,
> regardless of the current value of parent->bi_status.
> B uses cmpxchg to only update parent->bi_status if it is still zero.
>
> Thus, B could avoid unnecessary writes in cases where parent->bi_status has
> already been set to a non-zero value.
>
> Do you think this optimization would be beneficial in practice, or is
> the difference
> negligible?
cmpxchg() is much more expensive than a plain write, as it compiles
into an atomic instruction (or load-linked/store-conditional pair on
architectures that don't provide such an instruction). This requires
the processor to take exclusive access of the cache line for the
duration of the operation (or check whether the cache line has been
invalidated during the operation). On x86, cmpxchg() marks the cache
line as modified regardless of whether the compare succeeds and the
value is updated. So it doesn't actually avoid the cost of a write.
However, the original code's check of parent->bi_status before writing
to it *does* avoid the write if parent->bi_status is already nonzero.
So the optimization you're looking for is already implemented :)
If __bio_chain_endio() can be called concurrently from multiple
threads accessing the same parent bio, it should be using WRITE_ONCE()
(and READ_ONCE(), if applicable) to access parent->bi_status to avoid
the data race. On x86 and ARM these macros produce the same
instruction as a normal write, but they may be required on other
architectures to avoid tearing, as well as to prevent the compiler
from adding or removing memory accesses based on the assumption that
the values aren't being accessed concurrently by other threads.
Best,
Caleb
Powered by blists - more mailing lists