[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANubcdUG_3VwagV-cSfhp4+95Dj_e-wkxegzCdmuNieWqrehug@mail.gmail.com>
Date: Mon, 24 Nov 2025 10:00:42 +0800
From: Stephen Zhang <starzhangzsd@...il.com>
To: Ming Lei <ming.lei@...hat.com>
Cc: Andreas Gruenbacher <agruenba@...hat.com>, linux-kernel@...r.kernel.org,
linux-block@...r.kernel.org, nvdimm@...ts.linux.dev,
virtualization@...ts.linux.dev, linux-nvme@...ts.infradead.org,
gfs2@...ts.linux.dev, ntfs3@...ts.linux.dev, linux-xfs@...r.kernel.org,
zhangshida@...inos.cn, Coly Li <colyli@...as.com>, linux-bcache@...r.kernel.org
Subject: Re: Fix potential data loss and corruption due to Incorrect BIO Chain Handling
Stephen Zhang <starzhangzsd@...il.com> 于2025年11月24日周一 09:28写道:
>
> Ming Lei <ming.lei@...hat.com> 于2025年11月23日周日 21:49写道:
> >
> > On Sat, Nov 22, 2025 at 03:56:58PM +0100, Andreas Gruenbacher wrote:
> > > On Sat, Nov 22, 2025 at 1:07 PM Ming Lei <ming.lei@...hat.com> wrote:
> > > > > static void bio_chain_endio(struct bio *bio)
> > > > > {
> > > > > bio_endio(__bio_chain_endio(bio));
> > > > > }
> > > >
> > > > bio_chain_endio() never gets called really, which can be thought as `flag`,
> > >
> > > That's probably where this stops being relevant for the problem
> > > reported by Stephen Zhang.
> > >
> > > > and it should have been defined as `WARN_ON_ONCE(1);` for not confusing people.
> > >
> > > But shouldn't bio_chain_endio() still be fixed to do the right thing
> > > if called directly, or alternatively, just BUG()? Warning and still
> > > doing the wrong thing seems a bit bizarre.
> >
> > IMO calling ->bi_end_io() directly shouldn't be encouraged.
> >
> > The only in-tree direct call user could be bcache, so is this reported
> > issue triggered on bcache?
> >
I need to confirm the details later. However, let's assume our analysis provides
a theoretical model that explains all the observed phenomena without any
inconsistencies. Furthermore, we have a real-world problem that exhibits all
these same phenomena exactly.
In such a scenario, the chances that our analysis is incorrect are very low.
Even if bcache is not part of the running configuration, our later invetigation
will revolve around that analysis.
Therefore, what I want to explore further is: does this analysis can
really hold up
and perfectly explain everything without inconsistencies, assuming we can
introduce as much complex runtime configuration as possible?
Thanks,
Shida
> > If bcache can't call bio_endio(), I think it is fine to fix
> > bio_chain_endio().
> >
> > >
> > > I also see direct bi_end_io calls in erofs_fileio_ki_complete(),
> > > erofs_fscache_bio_endio(), and erofs_fscache_submit_bio(), so those
> > > are at least confusing.
> >
> > All looks FS bio(non-chained), so bio_chain_endio() shouldn't be involved
> > in erofs code base.
> >
>
> Okay, will add that.
>
> Thanks,
> Shida
>
> >
> > Thanks,
> > Ming
> >
Powered by blists - more mailing lists