linux-kernel - Re: [PATCH v5 3/3] block: prevent race condition on bi_status in __bio_chain

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANubcdVgSO1RwAU+mp5AguaZk78SqrBRS-eWbgz1mNLs7JA_uA@mail.gmail.com>
Date: Fri, 5 Dec 2025 15:13:09 +0800
From: Stephen Zhang <starzhangzsd@...il.com>
To: Andreas Gruenbacher <agruenba@...hat.com>
Cc: Johannes.Thumshirn@....com, hch@...radead.org, ming.lei@...hat.com, 
	hsiangkao@...ux.alibaba.com, csander@...estorage.com, colyli@...as.com, 
	linux-block@...r.kernel.org, linux-bcache@...r.kernel.org, 
	linux-kernel@...r.kernel.org, zhangshida@...inos.cn, 
	Christoph Hellwig <hch@....de>
Subject: Re: [PATCH v5 3/3] block: prevent race condition on bi_status in __bio_chain_endio

Andreas Gruenbacher <agruenba@...hat.com> 于2025年12月4日周四 20:01写道：
>
> On Thu, Dec 4, 2025 at 3:48 AM zhangshida <starzhangzsd@...il.com> wrote:
> > From: Shida Zhang <zhangshida@...inos.cn>
> >
> > Andreas point out that multiple completions can race setting
> > bi_status.
>
> What I've actually  pointed out is that the '!parent->bi_status' check
> in this statement is an unnecessary optimization that can be removed.
> But this is not what this discussion is mainly about anymore.
>
> In the current code, multiple completions can race setting bi_status,
> but that is fine as long as bi_status is never set to 0 during bio
> completion. The effect is that when there are multiple errors, the
> bi_error field of the final bio will eventually be set to an error
> code, but we don't know which error code will win. This all works
> correctly today, and there is no race to fix because the race is
> intentional.
>
> > If __bio_chain_endio() is called concurrently from multiple threads
> > accessing the same parent bio, it should use WRITE_ONCE()/READ_ONCE()
> > to access parent->bi_status and avoid data races.
> >
> > On x86 and ARM, these macros compile to the same instruction as a
> > normal write, but they may be required on other architectures to
> > prevent tearing, and to ensure the compiler does not add or remove
> > memory accesses under the assumption that the values are not accessed
> > concurrently.
>
> WRITE_ONCE() and READ_ONCE() also prevent the compiler from reordering
> operations. Even when the compiler doesn't seem to do anything nasty
> at the moment, it would probably still be worthwhile to use
> WRITE_ONCE() for setting bi_status throughout the code. But that's
> beyond the scope of this patch, and it calls for more than a global
> search and replace job.
>
> > Adopting a cmpxchg approach, as used in other code paths, resolves all
> > these issues, as suggested by Christoph.
>
> No, the cmpxchg() doesn't actually achieve anything, it only makes
> things worse. For example, when there is an A -> B chain, we can end
> up with the following sequence of events:
>
>   - A fails, sets A->bi_status, and calls bio_endio(A).
>   - B->status is still 0, so bio_endio(A) sets B->bi_status to A->bi_status.
>   - B fails and sets B->bi_status, OVERRIDING the value of A->bi_status.
>   - bio_endio(B) calls B->bi_end_io().
>
> Things get worse in an A -> B -> C chain, but I've already mentioned
> that earlier in this thread.
>
> So again, the cmpxchg() is unnecessary, but it is also harmless
> because it suggests that there is some form of synchronization that
> doesn't exist. The btrfs code that the cmpxchg() was taken from seems
> to implement actual first-failure-wins semantics, but this patch does
> not.
>
> The underlying question here is whether we want to change things so
> that bi_status is set to the first error that occurs (probably first
> in time, not first in the chain). If that is the goal, then we should
> be explicit about it. Right now, I don't see the need.
>

Thank you for the thorough discussion on this matter. Based on my
understanding, there appear to be two main viewpoints:

Side 1 (S1) argues:
A. Using cmpxchg() addresses all the issues that
READ_ONCE()/WRITE_ONCE() can solve. That said, cmpxchg()
appears to provide a comprehensive solution.
B. While cmpxchg() introduces some additional overhead, this may not
be significant since it occurs in a slow error-handling path.
Or nondeterministic order of overriding, Which is not a problem.

Side 2 (S2) argues:
A. While READ_ONCE()/WRITE_ONCE()-style protections may be
needed, addressing them in this specific patch may not be appropriate.
B. The cmpxchg() approach adds unnecessary complexity without clear
benefits.

The trade-off between S1B and S2B seems somewhat balanced, with
no clear winner.
However, regarding S1A versus S2A, I wonder:

If we agree that READ_ONCE()/WRITE_ONCE() protections are required
here, is there a specific reason we couldn't address just this code snippet
in this patch? We could then follow the S1 approach and use cmpxchg().

However, If this change remains controversial, I'd be happy to drop it and
resend v6 without this modification.

Thanks,
Shida





> Thanks,
> Andreas
>
> > Suggested-by: Andreas Gruenbacher <agruenba@...hat.com>
> > Suggested-by: Christoph Hellwig <hch@...radead.org>
> > Suggested-by: Caleb Sander Mateos <csander@...estorage.com>
> > Reviewed-by: Christoph Hellwig <hch@....de>
> > Signed-off-by: Shida Zhang <zhangshida@...inos.cn>
> > ---
> >  block/bio.c | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> >
> > diff --git a/block/bio.c b/block/bio.c
> > index cfb751dfcf5..51b57f9d8bd 100644
> > --- a/block/bio.c
> > +++ b/block/bio.c
> > @@ -314,8 +314,9 @@ static struct bio *__bio_chain_endio(struct bio *bio)
> >  {
> >         struct bio *parent = bio->bi_private;
> >
> > -       if (bio->bi_status && !parent->bi_status)
> > -               parent->bi_status = bio->bi_status;
> > +       if (bio->bi_status)
> > +               cmpxchg(&parent->bi_status, 0, bio->bi_status);
> > +
> >         bio_put(bio);
> >         return parent;
> >  }
> > --
> > 2.34.1
> >
>