linux-kernel - Re: [PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can deadlock state manager creation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <de47c214bf6c85735db8e74d630e05b4e9bbf767.camel@kernel.org>
Date: Wed, 31 Dec 2025 13:08:41 -0500
From: Trond Myklebust <trondmy@...nel.org>
To: Wang Zhaolong <wangzhaolong@...weicloud.com>, anna@...nel.org, 
	kolga@...app.com
Cc: linux-nfs@...r.kernel.org, linux-kernel@...r.kernel.org, 
	yi.zhang@...wei.com, yangerkun@...wei.com, chengzhihao1@...wei.com, 
	lilingfeng3@...wei.com, zhangjian496@...wei.com
Subject: Re: [PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can
 deadlock state manager creation

On Tue, 2025-12-30 at 15:17 +0800, Wang Zhaolong wrote:
> Hi all,
> 
> I’d like to start an RFC discussion about a hung-task/deadlock that
> we hit in
> production-like testing on NFSv4.1 clients under server outage +
> memory
> pressure. The system becomes stuck even after the server/network is
> restored.
> 
> The scenario is:
> 
> - NFSv4.1 client running heavy multi-threaded buffered I/O (fio-style
> workload)
> - server outage (restart/power-off) and/or network blackhole
> - client under significant memory pressure / reclaim activity
> (observed in the
>   traces below)
> 
> The observed behavior is a deadlock cycle involving:
> 
> - v4.1 session slot table “draining” (NFS4_SLOT_TBL_DRAINING)
> - state manager thread creation via kthread_run()
> - kthreadd entering direct reclaim and getting stuck in NFS
> commit/writeback paths
> - non-privileged RPC tasks sleeping on slot table waitq due to
> draining
> 
> Below is the call-chain I reconstructed from traces (three key
> participants):
> 
> P1: sunrpc worker 1 (error handling triggers session recovery and
> tries to startstate manager)
> rpc_exit_task
>   nfs_writeback_done
>     nfs4_write_done
>       nfs4_sequence_done
>         nfs41_sequence_process
>           // status error, goto session_recover
>           set_bit(NFS4_SLOT_TBL_DRAINING, &session-
> >fc_slot_table.slot_tbl_state)  <1>
>           nfs4_schedule_session_recovery
>             nfs4_schedule_state_manager
>               kthread_run  // - Create a state manager thread to
> release the draining slots
>                 kthread_create_on_node
>                   __kthread_create_on_node
>                     wait_for_completion(&done);   <2> wait for <3>
> 
> P2: kthreadd (thread creation triggers reclaim; reclaim hits NFS
> folios and blocks in commit wait)
> kthreadd
>  kernel_thread
>   kernel_clone
>    copy_process
>     dup_task_struct
>      alloc_thread_stack_node
>       __vmalloc_node_range
>        __vmalloc_area_node
>         vm_area_alloc_pages
>          alloc_pages_bulk_array_mempolicy
>           __alloc_pages_bulk
>            __alloc_pages
>             __perform_reclaim
>              try_to_free_pages
>               do_try_to_free_pages
>                shrink_zones
>                 shrink_node
>                  shrink_node_memcgs
>                   shrink_lruvec
>                    shrink_inactive_list
>                     shrink_folio_list
>                      filemap_release_folio
>                       nfs_release_folio
>                        nfs_wb_folio
>                         folio PG_private !PG_writeback !PG_dirty
>                          nfs_commit_inode(inode, FLUSH_SYNC);
>                           __nfs_commit_inode
>                            nfs_generic_commit_list
>                             nfs_commit_list
>                              nfs_initiate_commit
>                               rpc_run_task  // Async task
>                            wait_on_commit            <3> wait for <4>
> 
> P3: sunrpc worker 2 (non-privileged tasks are blocked by draining)
> __rpc_execute
>   nfs4_setup_sequence
>     // if (nfs4_slot_tbl_draining(tbl) && !args->sa_privileged) goto
> sleep
>     rpc_sleep_on(&tbl->slot_tbl_waitq, task, NULL);    <4>  blocked
> by <1>
> 
> This forms a deadlock:
> 
> - <1> enables draining; non-privileged requests then block at <4>
> - recovery path attempts to create the state manager thread, but
>   blocks at <2> waiting for kthreadd
> - kthreadd is blocked at <3> waiting for commit progress /
> completion,
>   but commit/RPC progress is impeded because requests are stuck
> behind draining at <4>
> - once in this state, restoring the server/network does not resolve
> the deadlock
> 
> I suspect this deadlock became possible after the following mainline
> change that
> freezes the session table immediately on NFS4ERR_BADSESSION (and
> similar error paths):
> 
> c907e72f58ed ("NFSv4.1: freeze the session table upon receiving
> NFS4ERR_BADSESSION")
> 
> It sets NFS4_SLOT_TBL_DRAINING before the recovery thread runs:
> 
> Questions:
> 
> 1. Has anyone else observed a similar deadlock involving slot table
> draining + memory
>    reclaim? It looks like a similar issue might have been reported
> before — see
>    SUSE Bugzilla #1211527. [1]
> 2. Is it intended that kthreadd (or other critical kernel threads)
> may block in
>    nfs_commit_inode(FLUSH_SYNC) as part of reclaim?
> 3. Is there an established way to ensure recovery threads can always
> be created even
>    under severe memory pressure (e.g., reserve resources, GFP flags,
> or moving state
>    manager creation out of contexts that can trigger reclaim)?
> 
> I wrote a local patch purely as a discussion starter. I realize this
> approach is likely
> not the right solution upstream; I’m sharing it only to help reason
> about where the
> cycle can be broken. I can post the patch if people think it’s useful
> for the discussion.

I think we need to treat nfs_release_folio() as being different from
nfs_wb_folio() altogether. There are too many different ways in which
waiting on I/O can cause deadlocks, including waiting on the writeback
to complete. My suggestion is therefore that we just make this simple,
and see it as a hint that we should kick off either writeback or a
commit, but not that we should be waiting for it to complete.
So how about the following?

8<------------------------------------------------------------------
>From 3533434037066b610d50e7bd36f3525ace296928 Mon Sep 17 00:00:00 2001
Message-ID: <3533434037066b610d50e7bd36f3525ace296928.1767204181.git.trond.myklebust@...merspace.com>
From: Trond Myklebust <trond.myklebust@...merspace.com>
Date: Wed, 31 Dec 2025 11:42:31 -0500
Subject: [PATCH] NFS: Fix a deadlock involving nfs_release_folio()

Wang Zhaolong reports a deadlock involving NFSv4.1 state recovery
waiting on kthreadd, which is attempting to reclaim memory by calling
nfs_release_folio(). The latter cannot make progress due to state
recovery being needed.

It seems that the only safe thing to do here is to kick off a writeback
of the folio, without waiting for completion, or else kicking off an
asynchronous commit.

Reported-by: Wang Zhaolong <wangzhaolong@...weicloud.com>
Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page")
Signed-off-by: Trond Myklebust <trond.myklebust@...merspace.com>
---
 fs/nfs/file.c          |  3 ++-
 fs/nfs/nfstrace.h      |  3 +++
 fs/nfs/write.c         | 32 ++++++++++++++++++++++++++++++++
 include/linux/nfs_fs.h |  1 +
 4 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d020aab40c64..d1c138a416cf 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -511,7 +511,8 @@ static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
 		if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
 		    current_is_kswapd() || current_is_kcompactd())
 			return false;
-		if (nfs_wb_folio(folio->mapping->host, folio) < 0)
+		if (nfs_wb_folio_reclaim(folio->mapping->host, folio) < 0 ||
+		    folio_test_private(folio))
 			return false;
 	}
 	return nfs_fscache_release_folio(folio, gfp);
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 6ce55e8e6b67..9f9ce4a565ea 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1062,6 +1062,9 @@ DECLARE_EVENT_CLASS(nfs_folio_event_done,
 DEFINE_NFS_FOLIO_EVENT(nfs_aop_readpage);
 DEFINE_NFS_FOLIO_EVENT_DONE(nfs_aop_readpage_done);
 
+DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio_reclaim);
+DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_reclaim_done);
+
 DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio);
 DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_done);
 
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 336c510f3750..5de9ec6c72a2 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2024,6 +2024,38 @@ int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio)
 	return ret;
 }
 
+/**
+ * nfs_wb_folio_reclaim - Write back all requests on one page
+ * @inode: pointer to page
+ * @folio: pointer to folio
+ *
+ * Assumes that the folio has been locked by the caller
+ */
+int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio)
+{
+	loff_t range_start = folio_pos(folio);
+	size_t len = folio_size(folio);
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_ALL,
+		.nr_to_write = 0,
+		.range_start = range_start,
+		.range_end = range_start + len - 1,
+		.for_sync = 1,
+	};
+	int ret;
+
+	if (folio_test_writeback(folio))
+		return -EBUSY;
+	if (folio_clear_dirty_for_io(folio)) {
+		trace_nfs_writeback_folio_reclaim(inode, range_start, len);
+		ret = nfs_writepage_locked(folio, &wbc);
+		trace_nfs_writeback_folio_reclaim_done(inode, range_start, len,
+						       ret);
+	} else
+		nfs_commit_inode(inode, 0);
+	return ret;
+}
+
 /**
  * nfs_wb_folio - Write back all requests on one page
  * @inode: pointer to page
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a6624edb7226..8dd79a3f3d66 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -637,6 +637,7 @@ extern int  nfs_update_folio(struct file *file, struct folio *folio,
 extern int nfs_sync_inode(struct inode *inode);
 extern int nfs_wb_all(struct inode *inode);
 extern int nfs_wb_folio(struct inode *inode, struct folio *folio);
+extern int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio);
 int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio);
 extern int  nfs_commit_inode(struct inode *, int);
 extern struct nfs_commit_data *nfs_commitdata_alloc(void);
-- 
2.52.0