[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c2b5eafc60e753cba2f7ffe88941f10d65cefa64.camel@ibm.com>
Date: Tue, 26 Aug 2025 22:33:46 +0000
From: Viacheslav Dubeyko <Slava.Dubeyko@....com>
To: "brauner@...nel.org" <brauner@...nel.org>,
Patrick Donnelly
<pdonnell@...hat.com>,
"linux-fsdevel@...r.kernel.org"
<linux-fsdevel@...r.kernel.org>,
"ceph-devel@...r.kernel.org"
<ceph-devel@...r.kernel.org>,
"linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>,
"slava@...eyko.com" <slava@...eyko.com>,
David Howells <dhowells@...hat.com>,
Alex Markuze <amarkuze@...hat.com>,
"willy@...radead.org" <willy@...radead.org>,
"idryomov@...il.com"
<idryomov@...il.com>
Subject: RE: [RFC PATCH 3/4] ceph: introduce ceph_submit_write() method
On Wed, 2025-08-27 at 00:06 +0200, Max Kellermann wrote:
> On 2025/02/14 22:11, Matthew Wilcox <willy@...radead.org> wrote:
> > On Tue, Feb 04, 2025 at 04:02:48PM -0800, Viacheslav Dubeyko wrote:
> > > This patch implements refactoring of ceph_submit_write()
> > > and also it solves the second issue.
> > >
> > > Signed-off-by: Viacheslav Dubeyko <Slava.Dubeyko@....com>
> >
> > This kind of giant refactoring to solve a bug is a really bad idea.
> > First, it's going to need to be backported to older kernels. How far
> > back? You need to identify that with a Fixes: line.
> >
> > It's also really hard to review and know whether it's right. You might
> > have introduced several new bugs while doing it. In general, bugfixes
> > first, refactor later. I *think* this means we can do without 1/7 of
> > the patches I resent earlier today, but it's really hard to be sure.
>
> I'm very disappointed that nobody has listened to Matthew's complaint.
> Viacheslav has only hand-waved it away with a lazy non-argument.
>
> From Documentation/process/submitting-patches.rst:
>
> "you should not modify the moved code at all in the same patch which
> moves it"
>
> Obviously, this patch set violates this rule. There are lots of
> semantic/behavior changes in the patches that move code around.
>
> In the end, Christian Brauner has merged this into Linux 6.15 and that
> merge has wreaked havoc in our production clusters. We have been
> testing 6.15 for a month with no problems (after David Howells had
> fixed yet-another netfs regression that was stable-backported to 6.15,
> ugh!), but when we updated a production clusters, all servers had
> crashed after a few hours, and our ops team had a very bad night.
>
> This patch set is obviously bad. It pretends to fix a bug, but really
> rewrites almost everything in two patches documented as "introduce XY
> method" with no real explanation for why Viacheslav has decided to do
> it, instead of just fixing the bug (as Matthew asked him to).
>
> Look at this line modified by patch "ceph: extend ceph_writeback_ctl
> for ceph_writepages_start() refactoring":
>
> > + ceph_wbc.fbatch.folios[i] = NULL;
>
> This sets a folio_batch element to NULL, which will, of course, crash
> in folios_put_refs() (but only if the global huge zero page has
> already been created). Fortunately, there's code that removes all
> NULL elements from the folio_batch array. That is code that already
> existed before Viacheslav's patch set (code which I already dislike
> because it's a fragile mess that is just waiting to crash), and the
> code was only being moved around.
>
> Did I mention that I think this is a fragile mess? Fast-forward to
> Viacheslav's patch "ceph: introduce ceph_process_folio_batch() method"
> which moves the NULL-setting loop to ceph_process_folio_batch(). Look
> at this (untouched) piece of code after the ceph_process_folio_batch()
> call:
>
> > if (i) {
> > unsigned j, n = 0;
> > /* shift unused page to beginning of fbatch */
>
> Shifting only happens if at least one folio has been processed ("i"
> was incremented). But does it really happen?
>
> No, the loop was moved to ceph_process_folio_batch(), and nobody ever
> increments "i" anymore. Voila, crash due to NULL pointer dereference:
>
> BUG: kernel NULL pointer dereference, address: 0000000000000034
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 0 P4D 0
> Oops: Oops: 0002 [#1] SMP NOPTI
> CPU: 172 UID: 0 PID: 2342707 Comm: kworker/u778:8 Not tainted 6.15.10-cm4all1-es #714 NONE
> Hardware name: Dell Inc. PowerEdge R7615/0G9DHV, BIOS 1.6.10 12/08/2023
> Workqueue: writeback wb_workfn (flush-ceph-1)
> RIP: 0010:folios_put_refs+0x85/0x140
> Code: 83 c5 01 39 e8 7e 76 48 63 c5 49 8b 5c c4 08 b8 01 00 00 00 4d 85 ed 74 05 41 8b 44 ad 00 48 8b 15 b0 >
> RSP: 0018:ffffb880af8db778 EFLAGS: 00010207
> RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000003
> RDX: ffffe377cc3b0000 RSI: 0000000000000000 RDI: ffffb880af8db8c0
> RBP: 0000000000000000 R08: 000000000000007d R09: 000000000102b86f
> R10: 0000000000000001 R11: 00000000000000ac R12: ffffb880af8db8c0
> R13: 0000000000000000 R14: 0000000000000000 R15: ffff9bd262c97000
> FS: 0000000000000000(0000) GS:ffff9c8efc303000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000034 CR3: 0000000160958004 CR4: 0000000000770ef0
> PKRU: 55555554
> Call Trace:
> <TASK>
> ceph_writepages_start+0xeb9/0x1410
>
> Viacheslav's third patch "ceph: introduce ceph_submit_write() method"
> messes up the logic a bit more and makes it more fragile by hiding the
> shifting code behind more conditions:
>
> - if ceph_process_folio_batch() fails, shifting never happens
>
> - if ceph_move_dirty_page_in_page_array() was never called (because
> ceph_process_folio_batch() has returned early for some of various
> reasons), shifting never happens
>
> - if processed_in_fbatch is zero (because ceph_process_folio_batch()
> has returned early for some of the reasons mentioned above or
> because ceph_move_dirty_page_in_page_array() has failed), shifting
> never happens
>
> If shifting doesn't happen, then the kernel crashes (unless
> huge-zero-page doesn't exist, see above). Obviously, nobody has ever
> looked closely enough at the code. I'm still new to Linux memory
> management and file systems, but these problems were obvious when I
> first saw this patch set (which was my candidate for the other 6.15
> crashes which then turned out to be netfs regressions, not Ceph).
>
> This whole patch set is a huge mess and has caused my team a good
> amount of pain. This could and should have been avoided, had only
> somebody listened to Matthew.
>
> (Also look at all those "checkpatch.pl" complaints on all patches in
> this patch set. There are many coding style violations.)
>
> Can we please revert the whole patch set? I don't think it's possible
> to fix all the weird undocumented side effects that may cause more
> crashes once we fix this one. Refactoring the Ceph code sure is
> necessary, it's not in a good shape, but it should be done more
> carefully. Some people (like me) depend on Ceph's stability, and this
> mess is doing a disservice to Ceph's reputation.
Of course, we can revert any patch. This patchset has been sent not with the
goal of pure refactoring but it fixes several bugs. Reverting means returning
these bugs back. This patchset was available for review for a long time. So,
anybody was able to review it and to share any comments. I am always happy to
rework and to make any patch more better. From my point of view, reverting is
not answer and it makes sense to continue fix bugs and to make CephFS code more
stable.
Thanks,
Slava.
Powered by blists - more mailing lists