[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2724318.1752066097@warthog.procyon.org.uk>
Date: Wed, 09 Jul 2025 14:01:37 +0100
From: David Howells <dhowells@...hat.com>
To: Max Kellermann <max.kellermann@...os.com>
Cc: dhowells@...hat.com, Christian Brauner <christian@...uner.io>,
Steve French <sfrench@...ba.org>, Paulo Alcantara <pc@...guebit.com>,
netfs@...ts.linux.dev, linux-afs@...ts.infradead.org,
linux-cifs@...r.kernel.org, linux-nfs@...r.kernel.org,
ceph-devel@...r.kernel.org, v9fs@...ts.linux.dev,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH 00/13] netfs, cifs: Fixes to retry-related code
Max Kellermann <max.kellermann@...os.com> wrote:
> your commit 2b1424cd131c ("netfs: Fix wait/wake to be consistent about
> the waitqueue used") has given me serious headaches; it has caused
> outages in our web hosting clusters (yet again - all Linux versions
> since 6.9 had serious netfs regressions). Your patch was backported to
> 6.15 as commit 329ba1cb402a in 6.15.3 (why oh why??), and therefore
> the bugs it has caused will be "available" to all Linux stable users.
>
> The problem we had is that writing to certain files never finishes. It
> looks like it has to do with the cachefiles subrequest never reporting
> completion. (We use Ceph with cachefiles)
>
> I have tried applying the fixes in this pull request, which sounded
> promising, but the problem is still there. The only thing that helps
> is reverting 2b1424cd131c completely - everything is fine with 6.15.5
> plus the revert.
>
> What do you need from me in order to analyze the bug?
As a start, can you turn on:
echo 65536 >/sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_failure/enable
If you keep an eye on /proc/fs/netfs/requests you should be able to see any
tasks in there that get stuck. If one gets stuck, then:
echo 0 > /sys/kernel/debug/tracing/events/enable
to stop further tracing.
Looking in /proc/fs/netfs/requests, you should be able to see the debug ID of
the stuck request. If you can try grepping the trace log for that:
grep "R=<8-digit-hex-id>" /sys/kernel/debug/tracing/trace
that should hopefully let me see how things progressed on that call.
David
Powered by blists - more mailing lists