[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKPOu+_5YjJnm_4KVehELsWLRWpET-pPWo4VH1GFK_xtgd2uqw@mail.gmail.com>
Date: Sun, 28 Jul 2024 15:17:20 +0200
From: Max Kellermann <max.kellermann@...os.com>
To: Jeff Layton <jlayton@...nel.org>
Cc: David Howells <dhowells@...hat.com>, netfs@...ts.linux.dev,
linux-kernel@...r.kernel.org, ceph-devel@...r.kernel.org,
Xiubo Li <xiubli@...hat.com>
Subject: Re: RCU stalls and GPFs in ceph/netfs
On Sun, Jul 28, 2024 at 1:45 PM Jeff Layton <jlayton@...nel.org> wrote:
> That is really weird. AFAICT, 2e9d7e4b984a61 is just removing some
> wrapper functions and changing the names of some others. There should
> be no functional changes there.
Exactly what I thought, I could not imagine how this commit could
cause such a bug. The only chance was that netfs_rreq_assess() now
always directly calls netfs_rreq_completed(), but not
netfs_rreq_write_to_cache(), but I don't know what that means - this
different code path could be a candidate for doing something
differently. Maybe it's an old bug that only got revealed by this
change.
Anyway, I tried to verify this and the preceding commit for hours, and
the picture was consistent: that commit reproduces the RCU stall
within minutes (though only 50% or so of all boots), and the previous
commit never did. There is still a tiny chance that I just wasn't
trying hard enough. I'm out of ideas, and all I can do now is start
digging really deeply into this code, but I thought it would be more
productive to reach out to the people who wrote it.
Max
Powered by blists - more mailing lists