[<prev] [next>] [day] [month] [year] [list]
Message-Id: <20251230071744.9762-1-wangzhaolong@huaweicloud.com>
Date: Tue, 30 Dec 2025 15:17:44 +0800
From: Wang Zhaolong <wangzhaolong@...weicloud.com>
To: trondmy@...nel.org,
anna@...nel.org,
kolga@...app.com
Cc: linux-nfs@...r.kernel.org,
linux-kernel@...r.kernel.org,
yi.zhang@...wei.com,
yangerkun@...wei.com,
chengzhihao1@...wei.com,
lilingfeng3@...wei.com,
zhangjian496@...wei.com,
wangzhaolong@...weicloud.com
Subject: [PATCH] [RFC] NFSv4.1: slot table draining + memory reclaim can deadlock state manager creation
Hi all,
I’d like to start an RFC discussion about a hung-task/deadlock that we hit in
production-like testing on NFSv4.1 clients under server outage + memory
pressure. The system becomes stuck even after the server/network is restored.
The scenario is:
- NFSv4.1 client running heavy multi-threaded buffered I/O (fio-style workload)
- server outage (restart/power-off) and/or network blackhole
- client under significant memory pressure / reclaim activity (observed in the
traces below)
The observed behavior is a deadlock cycle involving:
- v4.1 session slot table “draining” (NFS4_SLOT_TBL_DRAINING)
- state manager thread creation via kthread_run()
- kthreadd entering direct reclaim and getting stuck in NFS commit/writeback paths
- non-privileged RPC tasks sleeping on slot table waitq due to draining
Below is the call-chain I reconstructed from traces (three key participants):
P1: sunrpc worker 1 (error handling triggers session recovery and tries to startstate manager)
rpc_exit_task
nfs_writeback_done
nfs4_write_done
nfs4_sequence_done
nfs41_sequence_process
// status error, goto session_recover
set_bit(NFS4_SLOT_TBL_DRAINING, &session->fc_slot_table.slot_tbl_state) <1>
nfs4_schedule_session_recovery
nfs4_schedule_state_manager
kthread_run // - Create a state manager thread to release the draining slots
kthread_create_on_node
__kthread_create_on_node
wait_for_completion(&done); <2> wait for <3>
P2: kthreadd (thread creation triggers reclaim; reclaim hits NFS folios and blocks in commit wait)
kthreadd
kernel_thread
kernel_clone
copy_process
dup_task_struct
alloc_thread_stack_node
__vmalloc_node_range
__vmalloc_area_node
vm_area_alloc_pages
alloc_pages_bulk_array_mempolicy
__alloc_pages_bulk
__alloc_pages
__perform_reclaim
try_to_free_pages
do_try_to_free_pages
shrink_zones
shrink_node
shrink_node_memcgs
shrink_lruvec
shrink_inactive_list
shrink_folio_list
filemap_release_folio
nfs_release_folio
nfs_wb_folio
folio PG_private !PG_writeback !PG_dirty
nfs_commit_inode(inode, FLUSH_SYNC);
__nfs_commit_inode
nfs_generic_commit_list
nfs_commit_list
nfs_initiate_commit
rpc_run_task // Async task
wait_on_commit <3> wait for <4>
P3: sunrpc worker 2 (non-privileged tasks are blocked by draining)
__rpc_execute
nfs4_setup_sequence
// if (nfs4_slot_tbl_draining(tbl) && !args->sa_privileged) goto sleep
rpc_sleep_on(&tbl->slot_tbl_waitq, task, NULL); <4> blocked by <1>
This forms a deadlock:
- <1> enables draining; non-privileged requests then block at <4>
- recovery path attempts to create the state manager thread, but
blocks at <2> waiting for kthreadd
- kthreadd is blocked at <3> waiting for commit progress / completion,
but commit/RPC progress is impeded because requests are stuck behind draining at <4>
- once in this state, restoring the server/network does not resolve the deadlock
I suspect this deadlock became possible after the following mainline change that
freezes the session table immediately on NFS4ERR_BADSESSION (and similar error paths):
c907e72f58ed ("NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION")
It sets NFS4_SLOT_TBL_DRAINING before the recovery thread runs:
Questions:
1. Has anyone else observed a similar deadlock involving slot table draining + memory
reclaim? It looks like a similar issue might have been reported before — see
SUSE Bugzilla #1211527. [1]
2. Is it intended that kthreadd (or other critical kernel threads) may block in
nfs_commit_inode(FLUSH_SYNC) as part of reclaim?
3. Is there an established way to ensure recovery threads can always be created even
under severe memory pressure (e.g., reserve resources, GFP flags, or moving state
manager creation out of contexts that can trigger reclaim)?
I wrote a local patch purely as a discussion starter. I realize this approach is likely
not the right solution upstream; I’m sharing it only to help reason about where the
cycle can be broken. I can post the patch if people think it’s useful for the discussion.
Link: https://access.redhat.com/solutions/7016754 [1]
Fixes: c907e72f58ed ("NFSv4.1: freeze the session table upon receiving NFS4ERR_BADSESSION")
Signed-off-by: Wang Zhaolong <wangzhaolong@...weicloud.com>
---
fs/nfs/file.c | 6 +++---
fs/nfs/write.c | 10 +++++-----
include/linux/nfs_fs.h | 2 +-
3 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/nfs/file.c b/fs/nfs/file.c
index d020aab40c64..e556a16ce95b 100644
--- a/fs/nfs/file.c
+++ b/fs/nfs/file.c
@@ -487,11 +487,11 @@ static void nfs_invalidate_folio(struct folio *folio, size_t offset,
dfprintk(PAGECACHE, "NFS: invalidate_folio(%lu, %zu, %zu)\n",
folio->index, offset, length);
/* Cancel any unstarted writes on this page */
if (offset != 0 || length < folio_size(folio))
- nfs_wb_folio(inode, folio);
+ nfs_wb_folio(inode, folio, true);
else
nfs_wb_folio_cancel(inode, folio);
folio_wait_private_2(folio); /* [DEPRECATED] */
trace_nfs_invalidate_folio(inode, folio_pos(folio) + offset, length);
}
@@ -509,11 +509,11 @@ static bool nfs_release_folio(struct folio *folio, gfp_t gfp)
/* If the private flag is set, then the folio is not freeable */
if (folio_test_private(folio)) {
if ((current_gfp_context(gfp) & GFP_KERNEL) != GFP_KERNEL ||
current_is_kswapd() || current_is_kcompactd())
return false;
- if (nfs_wb_folio(folio->mapping->host, folio) < 0)
+ if (nfs_wb_folio(folio->mapping->host, folio, false) < 0)
return false;
}
return nfs_fscache_release_folio(folio, gfp);
}
@@ -558,11 +558,11 @@ static int nfs_launder_folio(struct folio *folio)
dfprintk(PAGECACHE, "NFS: launder_folio(%ld, %llu)\n",
inode->i_ino, folio_pos(folio));
folio_wait_private_2(folio); /* [DEPRECATED] */
- ret = nfs_wb_folio(inode, folio);
+ ret = nfs_wb_folio(inode, folio, true);
trace_nfs_launder_folio_done(inode, folio_pos(folio),
folio_size(folio), ret);
return ret;
}
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 336c510f3750..bc541a192197 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1059,11 +1059,11 @@ static struct nfs_page *nfs_try_to_update_request(struct folio *folio,
* nfs_lock_and_join_requests() cannot preserve
* commit flags, so we have to replay the write.
*/
nfs_mark_request_dirty(req);
nfs_unlock_and_release_request(req);
- error = nfs_wb_folio(folio->mapping->host, folio);
+ error = nfs_wb_folio(folio->mapping->host, folio, true);
trace_nfs_try_to_update_request_done(folio_inode(folio), offset, bytes, error);
return (error < 0) ? ERR_PTR(error) : NULL;
}
/*
@@ -1137,11 +1137,11 @@ int nfs_flush_incompatible(struct file *file, struct folio *folio)
do_flush |= l_ctx->lockowner != current->files;
}
nfs_release_request(req);
if (!do_flush)
return 0;
- status = nfs_wb_folio(folio->mapping->host, folio);
+ status = nfs_wb_folio(folio->mapping->host, folio, true);
} while (status == 0);
return status;
}
/*
@@ -2030,11 +2030,11 @@ int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio)
* @folio: pointer to folio
*
* Assumes that the folio has been locked by the caller, and will
* not unlock it.
*/
-int nfs_wb_folio(struct inode *inode, struct folio *folio)
+int nfs_wb_folio(struct inode *inode, struct folio *folio, bool sync)
{
loff_t range_start = folio_pos(folio);
size_t len = folio_size(folio);
struct writeback_control wbc = {
.sync_mode = WB_SYNC_ALL,
@@ -2055,11 +2055,11 @@ int nfs_wb_folio(struct inode *inode, struct folio *folio)
continue;
}
ret = 0;
if (!folio_test_private(folio))
break;
- ret = nfs_commit_inode(inode, FLUSH_SYNC);
+ ret = nfs_commit_inode(inode, sync ? FLUSH_SYNC: 0);
if (ret < 0)
goto out_error;
}
out_error:
trace_nfs_writeback_folio_done(inode, range_start, len, ret);
@@ -2078,11 +2078,11 @@ int nfs_migrate_folio(struct address_space *mapping, struct folio *dst,
* that we can safely release the inode reference while holding
* the folio lock.
*/
if (folio_test_private(src)) {
if (mode == MIGRATE_SYNC)
- nfs_wb_folio(src->mapping->host, src);
+ nfs_wb_folio(src->mapping->host, src, true);
if (folio_test_private(src))
return -EBUSY;
}
if (folio_test_private_2(src)) { /* [DEPRECATED] */
diff --git a/include/linux/nfs_fs.h b/include/linux/nfs_fs.h
index a6624edb7226..295bc6214750 100644
--- a/include/linux/nfs_fs.h
+++ b/include/linux/nfs_fs.h
@@ -634,11 +634,11 @@ extern int nfs_update_folio(struct file *file, struct folio *folio,
* Try to write back everything synchronously (but check the
* return value!)
*/
extern int nfs_sync_inode(struct inode *inode);
extern int nfs_wb_all(struct inode *inode);
-extern int nfs_wb_folio(struct inode *inode, struct folio *folio);
+extern int nfs_wb_folio(struct inode *inode, struct folio *folio, bool sync);
int nfs_wb_folio_cancel(struct inode *inode, struct folio *folio);
extern int nfs_commit_inode(struct inode *, int);
extern struct nfs_commit_data *nfs_commitdata_alloc(void);
extern void nfs_commit_free(struct nfs_commit_data *data);
void nfs_commit_begin(struct nfs_mds_commit_info *cinfo);
--
2.34.3
Powered by blists - more mailing lists