linux-kernel - Re: [PATCH 00/13] netfs, cifs: Fixes to retry-related code

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2724318.1752066097@warthog.procyon.org.uk>
Date: Wed, 09 Jul 2025 14:01:37 +0100
From: David Howells <dhowells@...hat.com>
To: Max Kellermann <max.kellermann@...os.com>
Cc: dhowells@...hat.com, Christian Brauner <christian@...uner.io>,
    Steve French <sfrench@...ba.org>, Paulo Alcantara <pc@...guebit.com>,
    netfs@...ts.linux.dev, linux-afs@...ts.infradead.org,
    linux-cifs@...r.kernel.org, linux-nfs@...r.kernel.org,
    ceph-devel@...r.kernel.org, v9fs@...ts.linux.dev,
    linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
    stable@...r.kernel.org
Subject: Re: [PATCH 00/13] netfs, cifs: Fixes to retry-related code

Max Kellermann <max.kellermann@...os.com> wrote:

> your commit 2b1424cd131c ("netfs: Fix wait/wake to be consistent about
> the waitqueue used") has given me serious headaches; it has caused
> outages in our web hosting clusters (yet again - all Linux versions
> since 6.9 had serious netfs regressions). Your patch was backported to
> 6.15 as commit 329ba1cb402a in 6.15.3 (why oh why??), and therefore
> the bugs it has caused will be "available" to all Linux stable users.
> 
> The problem we had is that writing to certain files never finishes. It
> looks like it has to do with the cachefiles subrequest never reporting
> completion. (We use Ceph with cachefiles)
> 
> I have tried applying the fixes in this pull request, which sounded
> promising, but the problem is still there. The only thing that helps
> is reverting 2b1424cd131c completely - everything is fine with 6.15.5
> plus the revert.
> 
> What do you need from me in order to analyze the bug?

As a start, can you turn on:

echo 65536 >/sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_read/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_rreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_sreq/enable
echo 1 > /sys/kernel/debug/tracing/events/netfs/netfs_failure/enable

If you keep an eye on /proc/fs/netfs/requests you should be able to see any
tasks in there that get stuck.  If one gets stuck, then:

echo 0 > /sys/kernel/debug/tracing/events/enable

to stop further tracing.

Looking in /proc/fs/netfs/requests, you should be able to see the debug ID of
the stuck request.  If you can try grepping the trace log for that:

grep "R=<8-digit-hex-id>" /sys/kernel/debug/tracing/trace

that should hopefully let me see how things progressed on that call.

David