[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <de47c214bf6c85735db8e74d630e05b4e9bbf767.camel@kernel.org>
Date: Wed, 31 Dec 2025 13:08:41 -0500
From: Trond Myklebust <trondmy@...nel.org>
To: Wang Zhaolong <wangzhaolong@...weicloud.com>, anna@...nel.org,
kolga@...app.com
Cc: linux-nfs@...r.kernel.org, linux-kernel@...r.kernel.org,
yi.zhang@...wei.com, yangerkun@...wei.com, chengzhihao1@...wei.com,
lilingfeng3@...wei.com, zhangjian496@...wei.com
Subject: Re: [PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can
deadlock state manager creation
On Tue, 2025-12-30 at 15:17 +0800, Wang Zhaolong wrote:
> Hi all,
>
> I’d like to start an RFC discussion about a hung-task/deadlock that
> we hit in
> production-like testing on NFSv4.1 clients under server outage +
> memory
> pressure. The system becomes stuck even after the server/network is
> restored.
>
> The scenario is:
>
> - NFSv4.1 client running heavy multi-threaded buffered I/O (fio-style
> workload)
> - server outage (restart/power-off) and/or network blackhole
> - client under significant memory pressure / reclaim activity
> (observed in the
> traces below)
>
> The observed behavior is a deadlock cycle involving:
>
> - v4.1 session slot table “draining” (NFS4_SLOT_TBL_DRAINING)
> - state manager thread creation via kthread_run()
> - kthreadd entering direct reclaim and getting stuck in NFS
> commit/writeback paths
> - non-privileged RPC tasks sleeping on slot table waitq due to
> draining
>
> Below is the call-chain I reconstructed from traces (three key
> participants):
>
> P1: sunrpc worker 1 (error handling triggers session recovery and
> tries to startstate manager)
> rpc_exit_task
> nfs_writeback_done
> nfs4_write_done
> nfs4_sequence_done
> nfs41_sequence_process
> // status error, goto session_recover
> set_bit(NFS4_SLOT_TBL_DRAINING, &session-
> >fc_slot_table.slot_tbl_state) <1>
> nfs4_schedule_session_recovery
> nfs4_schedule_state_manager
> kthread_run // - Create a state manager thread to
> release the draining slots
> kthread_create_on_node
> __kthread_create_on_node
> wait_for_completion(&done); <2> wait for <3>
>
> P2: kthreadd (thread creation triggers reclaim; reclaim hits NFS
> folios and blocks in commit wait)
> kthreadd
> kernel_thread
> kernel_clone
> copy_process
> dup_task_struct
> alloc_thread_stack_node
> __vmalloc_node_range
> __vmalloc_area_node
> vm_area_alloc_pages
> alloc_pages_bulk_array_mempolicy
> __alloc_pages_bulk
> __alloc_pages
> __perform_reclaim
> try_to_free_pages
> do_try_to_free_pages
> shrink_zones
> shrink_node
> shrink_node_memcgs
> shrink_lruvec
> shrink_inactive_list
> shrink_folio_list
> filemap_release_folio
> nfs_release_folio
> nfs_wb_folio
> folio PG_private !PG_writeback !PG_dirty
> nfs_commit_inode(inode, FLUSH_SYNC);
> __nfs_commit_inode
> nfs_generic_commit_list
> nfs_commit_list
> nfs_initiate_commit
> rpc_run_task // Async task
> wait_on_commit <3> wait for <4>
>
> P3: sunrpc worker 2 (non-privileged tasks are blocked by draining)
> __rpc_execute
> nfs4_setup_sequence
> // if (nfs4_slot_tbl_draining(tbl) && !args->sa_privileged) goto
> sleep
> rpc_sleep_on(&tbl->slot_tbl_waitq, task, NULL); <4> blocked
> by <1>
>
> This forms a deadlock:
>
> - <1> enables draining; non-privileged requests then block at <4>
> - recovery path attempts to create the state manager thread, but
> blocks at <2> waiting for kthreadd
> - kthreadd is blocked at <3> waiting for commit progress /
> completion,
> but commit/RPC progress is impeded because requests are stuck
> behind draining at <4>
> - once in this state, restoring the server/network does not resolve
> the deadlock
>
> I suspect this deadlock became possible after the following mainline
> change that
> freezes the session table immediately on NFS4ERR_BADSESSION (and
> similar error paths):
>
> c907e72f58ed ("NFSv4.1: freeze the session table upon receiving
> NFS4ERR_BADSESSION")
>
> It sets NFS4_SLOT_TBL_DRAINING before the recovery thread runs:
>
> Questions:
>
> 1. Has anyone else observed a similar deadlock involving slot table
> draining + memory
> reclaim? It looks like a similar issue might have been reported
> before — see
> SUSE Bugzilla #1211527. [1]
> 2. Is it intended that kthreadd (or other critical kernel threads)
> may block in
> nfs_commit_inode(FLUSH_SYNC) as part of reclaim?
> 3. Is there an established way to ensure recovery threads can always
> be created even
> under severe memory pressure (e.g., reserve resources, GFP flags,
> or moving state
> manager creation out of contexts that can trigger reclaim)?
>
> I wrote a local patch purely as a discussion starter. I realize this
> approach is likely
> not the right solution upstream; I’m sharing it only to help reason
> about where the
> cycle can be broken. I can post the patch if people think it’s useful
> for the discussion.
I think we need to treat nfs_release_folio() as being different from
nfs_wb_folio() altogether. There are too many different ways in which
waiting on I/O can cause deadlocks, including waiting on the writeback
to complete. My suggestion is therefore that we just make this simple,
and see it as a hint that we should kick off either writeback or a
commit, but not that we should be waiting for it to complete.
So how about the following?
8<------------------------------------------------------------------
>From 3533434037066b610d50e7bd36f3525ace296928 Mon Sep 17 00:00:00 2001
Message-ID: <3533434037066b610d50e7bd36f3525ace296928.1767204181.git.trond.myklebust@...merspace.com>
From: Trond Myklebust <trond.myklebust@...merspace.com>
Date: Wed, 31 Dec 2025 11:42:31 -0500
Subject: [PATCH] NFS: Fix a deadlock involving nfs_release_folio()
Wang Zhaolong reports a deadlock involving NFSv4.1 state recovery
waiting on kthreadd, which is attempting to reclaim memory by calling
nfs_release_folio(). The latter cannot make progress due to state
recovery being needed.
It seems that the only safe thing to do here is to kick off a writeback
of the folio, without waiting for completion, or else kicking off an
asynchronous commit.
Reported-by: Wang Zhaolong <wangzhaolong@...weicloud.com>
Fixes: 96780ca55e3c ("NFS: fix up nfs_release_folio() to try to release the page")
Signed-off-by: Trond Myklebust <trond.myklebust@...merspace.com>
---
fs/nfs/file.c | 3 ++-
fs/nfs/nfstrace.h | 3 +++
fs/nfs/write.c | 32 ++++++++++++++++++++++++++++++++
include/linux/nfs_fs.h | 1 +
4 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d020aab40c64..d1c138a416cf 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -511,7 +511,8 @@ static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
current_is_kswapd() || current_is_kcompactd())
return false;
- if (nfs_wb_folio(folio->mapping->host, folio) < 0)
+ if (nfs_wb_folio_reclaim(folio->mapping->host, folio) < 0 ||
+ folio_test_private(folio))
return false;
}
return nfs_fscache_release_folio(folio, gfp);
diff --git a/fs/nfs/nfstrace.h b/fs/nfs/nfstrace.h
index 6ce55e8e6b67..9f9ce4a565ea 100644
--- a/fs/nfs/nfstrace.h
+++ b/fs/nfs/nfstrace.h
@@ -1062,6 +1062,9 @@ DECLARE_EVENT_CLASS(nfs_folio_event_done,
DEFINE_NFS_FOLIO_EVENT(nfs_aop_readpage);
DEFINE_NFS_FOLIO_EVENT_DONE(nfs_aop_readpage_done);
+DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio_reclaim);
+DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_reclaim_done);
+
DEFINE_NFS_FOLIO_EVENT(nfs_writeback_folio);
DEFINE_NFS_FOLIO_EVENT_DONE(nfs_writeback_folio_done);
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 336c510f3750..5de9ec6c72a2 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -2024,6 +2024,38 @@ int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio)
return ret;
}
+/**
+ * nfs_wb_folio_reclaim - Write back all requests on one page
+ * @inode: pointer to page
+ * @folio: pointer to folio
+ *
+ * Assumes that the folio has been locked by the caller
+ */
+int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio)
+{
+ loff_t range_start = folio_pos(folio);
+ size_t len = folio_size(folio);
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ .nr_to_write = 0,
+ .range_start = range_start,
+ .range_end = range_start + len - 1,
+ .for_sync = 1,
+ };
+ int ret;
+
+ if (folio_test_writeback(folio))
+ return -EBUSY;
+ if (folio_clear_dirty_for_io(folio)) {
+ trace_nfs_writeback_folio_reclaim(inode, range_start, len);
+ ret = nfs_writepage_locked(folio, &wbc);
+ trace_nfs_writeback_folio_reclaim_done(inode, range_start, len,
+ ret);
+ } else
+ nfs_commit_inode(inode, 0);
+ return ret;
+}
+
/**
* nfs_wb_folio - Write back all requests on one page
* @inode: pointer to page
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a6624edb7226..8dd79a3f3d66 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -637,6 +637,7 @@ extern int nfs_update_folio(struct file *file, struct folio *folio,
extern int nfs_sync_inode(struct inode *inode);
extern int nfs_wb_all(struct inode *inode);
extern int nfs_wb_folio(struct inode *inode, struct folio *folio);
+extern int nfs_wb_folio_reclaim(struct inode *inode, struct folio *folio);
int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio);
extern int nfs_commit_inode(struct inode *, int);
extern struct nfs_commit_data *nfs_commitdata_alloc(void);
--
2.52.0
Powered by blists - more mailing lists