linux-kernel - Re: [PATCH 4/5] ceph: Assert writeback loop invariants

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAH5Ym4iZ7ZTsFxHCHQPnHS5U-vZL8=Z_5T0cXJeB56dzw+CQpw@mail.gmail.com>
Date: Tue, 6 Jan 2026 16:33:38 -0800
From: Sam Edwards <cfsworks@...il.com>
To: Viacheslav Dubeyko <Slava.Dubeyko@....com>
Cc: Milind Changire <mchangir@...hat.com>, Xiubo Li <xiubli@...hat.com>, 
	"idryomov@...il.com" <idryomov@...il.com>, "brauner@...nel.org" <brauner@...nel.org>, 
	"ceph-devel@...r.kernel.org" <ceph-devel@...r.kernel.org>, "jlayton@...nel.org" <jlayton@...nel.org>, 
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 4/5] ceph: Assert writeback loop invariants

On Tue, Jan 6, 2026 at 3:00 PM Viacheslav Dubeyko <Slava.Dubeyko@....com> wrote:
>
> On Mon, 2026-01-05 at 22:53 -0800, Sam Edwards wrote:
> > On Mon, Jan 5, 2026 at 2:29 PM Viacheslav Dubeyko <Slava.Dubeyko@....com> wrote:
> > >
> > > On Tue, 2025-12-30 at 18:43 -0800, Sam Edwards wrote:
> > > > If `locked_pages` is zero, the page array must not be allocated:
> > > > ceph_process_folio_batch() uses `locked_pages` to decide when to
> > > > allocate `pages`, and redundant allocations trigger
> > > > ceph_allocate_page_array()'s BUG_ON(), resulting in a worker oops (and
> > > > writeback stall) or even a kernel panic. Consequently, the main loop in
> > > > ceph_writepages_start() assumes that the lifetime of `pages` is confined
> > > > to a single iteration.
> > > >
> > > > This expectation is currently not clear enough, as evidenced by the
> > > > previous two patches which fix oopses caused by `pages` persisting into
> > > > the next loop iteration.
> > > >
> > > > Use an explicit BUG_ON() at the top of the loop to assert the loop's
> > > > preexisting expectation that `pages` is cleaned up by the previous
> > > > iteration. Because this is closely tied to `locked_pages`, also make it
> > > > the previous iteration's responsibility to guarantee its reset, and
> > > > verify with a second new BUG_ON() instead of handling (and masking)
> > > > failures to do so.
> > > >
> > > > Signed-off-by: Sam Edwards <CFSworks@...il.com>
> > > > ---
> > > >  fs/ceph/addr.c | 9 +++++----
> > > >  1 file changed, 5 insertions(+), 4 deletions(-)
> > > >
> > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > index 91cc43950162..b3569d44d510 100644
> > > > --- a/fs/ceph/addr.c
> > > > +++ b/fs/ceph/addr.c
> > > > @@ -1669,7 +1669,9 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > >               tag_pages_for_writeback(mapping, ceph_wbc.index, ceph_wbc.end);
> > > >
> > > >       while (!has_writeback_done(&ceph_wbc)) {
> > > > -             ceph_wbc.locked_pages = 0;
> > > > +             BUG_ON(ceph_wbc.locked_pages);
> > > > +             BUG_ON(ceph_wbc.pages);
> > > > +
> > >
> >
> > Hi Slava,
> >
> > > It's not good idea to introduce BUG_ON() in write pages logic. I am definitely
> > > against these two BUG_ON() here.
> >
> > I share your distaste for BUG_ON() in writeback. However, the
> > BUG_ON(ceph_wbc.pages); already exists in ceph_allocate_page_array().
> > This patch is trying to catch that earlier, where it's easier to
> > troubleshoot. If I changed these to WARN_ON(), it would not prevent
> > the oops.
> >
> > And the writeback code has a lot of BUG_ON() checks already (so I am
> > not "introducing" BUG_ON), but I suppose you could be saying "it's
> > already a problem, please don't make it worse."
> >
>
> It looks really strange that you do at first:
>
> -             ceph_wbc.locked_pages = 0;
>
> and then you introduce two BUG_ON():
>
> +             BUG_ON(ceph_wbc.locked_pages);
> +             BUG_ON(ceph_wbc.pages);
>
> But what's the point? It looks more natural simply to make initialization here:
>
>               ceph_wbc.locked_pages = 0;
>               ceph_wbc.strip_unit_end = 0;
>
> What's wrong with it?

The problem is if that block runs at the top of the loop while
ceph_wbc.pages != NULL, the worker will oops in
ceph_allocate_page_array(). This is a particularly difficult oops to
diagnose. We should prevent it by carefully maintaining the loop's
invariants, but if prevention fails, the next best option is to oops
earlier, as close as possible to the actual bug.

>
> > If I introduce a ceph_discard_page_array() function (as discussed on
> > patch 4), I could call it at the top of the loop (to *ensure* a clean
> > state) instead of using BUG_ON() (to *assert* a clean state). I'd like
> > to hear from other reviewers which approach they'd prefer.
> >
> > >
> > > >               ceph_wbc.max_pages = ceph_wbc.wsize >> PAGE_SHIFT;
> > > >
> > > >  get_more_pages:
> > > > @@ -1703,11 +1705,10 @@ static int ceph_writepages_start(struct address_space *mapping,
> > > >               }
> > > >
> > > >               rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
> > > > -             if (rc)
> > > > -                     goto release_folios;
> > > > -
> > >
> > > Frankly speaking, its' completely not clear the from commit message why we move
> > > this check. What's the problem is here? How this move can fix the issue?
> >
> > The diff makes it a little unclear, but I'm actually moving
> > ceph_wbc.{locked_pages,strip_unit_end} = 0; *above* the check (see
> > commit message: "also make it the previous iteration's responsibility
> > to guarantee [locked_pages is] reset") so that they happen
> > unconditionally. Git just happens to see it in terms of the if/goto
> > moving downward, not the assignments moving up.
>
> I simply don't see any explanation why we are moving this check.

The check is not being moved; other lines are being moved above it,
and Git's diff algorithm made it look like the check moved.
This equivalent diff makes the actual change clearer:
         rc = ceph_submit_write(mapping, wbc, &ceph_wbc);
+        ceph_wbc.locked_pages = 0;
+        ceph_wbc.strip_unit_end = 0;
         if (rc)
             goto release_folios;
-
-        ceph_wbc.locked_pages = 0;
-        ceph_wbc.strip_unit_end = 0;

         if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
             ceph_wbc.nr_folios =

> And what this
> move is fixing. I believe it's really important to explain what this
> modification is fixing.

This is not a bugfix; it's purely code cleanup -- more of the
"defensive programming" that we both like to see.

Cheers,
Sam

>
> Thanks,
> Slava.
>
> >
> > Warm regards,
> > Sam
> >
> >
> > >
> > > Thanks,
> > > Slava.
> > >
> > > >               ceph_wbc.locked_pages = 0;
> > > >               ceph_wbc.strip_unit_end = 0;
> > > > +             if (rc)
> > > > +                     goto release_folios;
> > > >
> > > >               if (folio_batch_count(&ceph_wbc.fbatch) > 0) {
> > > >                       ceph_wbc.nr_folios =