linux-kernel - Re: [PATCH] nfsd: decrease cl_cb_inflight if fail to queue cb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cb5540cc1f63a0761444850d153a41b2e33d5a8b.camel@kernel.org>
Date: Tue, 18 Feb 2025 10:51:01 -0500
From: Jeff Layton <jlayton@...nel.org>
To: Chuck Lever <chuck.lever@...cle.com>, Li Lingfeng
 <lilingfeng3@...wei.com>, 	neilb@...e.de, okorniev@...hat.com,
 Dai.Ngo@...cle.com, tom@...pey.com, 	linux-nfs@...r.kernel.org,
 linux-kernel@...r.kernel.org
Cc: yukuai1@...weicloud.com, houtao1@...wei.com, yi.zhang@...wei.com, 
	yangerkun@...wei.com, lilingfeng@...weicloud.com
Subject: Re: [PATCH] nfsd: decrease cl_cb_inflight if fail to queue cb_work

On Tue, 2025-02-18 at 10:02 -0500, Chuck Lever wrote:
> On 2/18/25 9:40 AM, Jeff Layton wrote:
> > On Tue, 2025-02-18 at 09:31 -0500, Chuck Lever wrote:
> > > On 2/18/25 9:29 AM, Jeff Layton wrote:
> > > > On Tue, 2025-02-18 at 08:58 -0500, Jeff Layton wrote:
> > > > > On Tue, 2025-02-18 at 21:54 +0800, Li Lingfeng wrote:
> > > > > > In nfsd4_run_cb, cl_cb_inflight is increased before attempting to queue
> > > > > > cb_work to callback_wq. This count can be decreased in three situations:
> > > > > > 1) If queuing fails in nfsd4_run_cb, the count will be decremented
> > > > > > accordingly.
> > > > > > 2) After cb_work is running, the count is decreased in the exception
> > > > > > branch of nfsd4_run_cb_work via nfsd41_destroy_cb.
> > > > > > 3) The count is decreased in the release callback of rpc_task — either
> > > > > > directly calling nfsd41_cb_inflight_end in nfsd4_cb_probe_release, or
> > > > > > calling nfsd41_destroy_cb in 	.
> > > > > > 
> > > > > > However, in nfsd4_cb_release, if the current cb_work needs to restart, the
> > > > > > count will not be decreased, with the expectation that it will be reduced
> > > > > > once cb_work is running.
> > > > > > If queuing fails here, then the count will leak, ultimately causing the
> > > > > > nfsd service to be unable to exit as shown below:
> > > > > > [root@..._test2 ~]# cat /proc/2271/stack
> > > > > > [<0>] nfsd4_shutdown_callback+0x22b/0x290
> > > > > > [<0>] __destroy_client+0x3cd/0x5c0
> > > > > > [<0>] nfs4_state_destroy_net+0xd2/0x330
> > > > > > [<0>] nfs4_state_shutdown_net+0x2ad/0x410
> > > > > > [<0>] nfsd_shutdown_net+0xb7/0x250
> > > > > > [<0>] nfsd_last_thread+0x15f/0x2a0
> > > > > > [<0>] nfsd_svc+0x388/0x3f0
> > > > > > [<0>] write_threads+0x17e/0x2b0
> > > > > > [<0>] nfsctl_transaction_write+0x91/0xf0
> > > > > > [<0>] vfs_write+0x1c4/0x750
> > > > > > [<0>] ksys_write+0xcb/0x170
> > > > > > [<0>] do_syscall_64+0x70/0x120
> > > > > > [<0>] entry_SYSCALL_64_after_hwframe+0x78/0xe2
> > > > > > [root@..._test2 ~]#
> > > > > > 
> > > > > > Fix this by decreasing cl_cb_inflight if the restart fails.
> > > > > > 
> > > > > > Fixes: cba5f62b1830 ("nfsd: fix callback restarts")
> > > > > > Signed-off-by: Li Lingfeng <lilingfeng3@...wei.com>
> > > > > > ---
> > > > > >  fs/nfsd/nfs4callback.c | 10 +++++++---
> > > > > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> > > > > > index 484077200c5d..8a7d24efdd08 100644
> > > > > > --- a/fs/nfsd/nfs4callback.c
> > > > > > +++ b/fs/nfsd/nfs4callback.c
> > > > > > @@ -1459,12 +1459,16 @@ static void nfsd4_cb_done(struct rpc_task *task, void *calldata)
> > > > > >  static void nfsd4_cb_release(void *calldata)
> > > > > >  {
> > > > > >  	struct nfsd4_callback *cb = calldata;
> > > > > > +	struct nfs4_client *clp = cb->cb_clp;
> > > > > > +	int queued;
> > > > > >  
> > > > > >  	trace_nfsd_cb_rpc_release(cb->cb_clp);
> > > > > >  
> > > > > > -	if (cb->cb_need_restart)
> > > > > > -		nfsd4_queue_cb(cb);
> > > > > > -	else
> > > > > > +	if (cb->cb_need_restart) {
> > > > > > +		queued = nfsd4_queue_cb(cb);
> > > > > > +		if (!queued)
> > > > > > +			nfsd41_cb_inflight_end(clp);
> > > > > > +	} else
> > > > > >  		nfsd41_destroy_cb(cb);
> > > > > >  
> > > > > >  }
> > > > > 
> > > > > Good catch!
> > > > > 
> > > > > Reviewed-by: Jeff Layton <jlayton@...nel.org>
> > > > > 
> > > > 
> > > > Actually, I think this is not quite right. It's a bit more subtle than
> > > > it first appears. The problem of course is that the callback workqueue
> > > > jobs run in a different task than the RPC workqueue jobs, so they can
> > > > race.
> > > > 
> > > > cl_cb_inflight gets bumped when the callback is first queued, and only
> > > > gets released in nfsd41_destroy_cb(). If it fails to be queued, it's
> > > > because something else has queued the workqueue job in the meantime.
> > > > 
> > > > There are two places that can occur: nfsd4_cb_release() and
> > > > nfsd4_run_cb(). Since this is occurring in nfsd4_cb_release(), the only
> > > > other option is that something raced in and queued it via
> > > > nfsd4_run_cb().
> > > 
> > > What would be the "something" that raced in?
> > > 
> > 
> > I think we may be able to get there via multiple __break_lease() calls
> > on the same layout or delegation. That could mean multiple calls to the
> > ->lm_break operation on the same object.
> 
> Makes sense.
> 
> Out of curiosity, what would be the complexity/performance cost of
> serializing the lm_break calls, or having each lm_break call register
> an interest in the CB_RECALL callback? Maybe not worth it.
> 

I'd probably resist trying to put that logic in generic code. I think
for now I'd rather just add this logic in the nfsd4_callback handling.

Longer term, I'm wondering if we really need the callback workqueue at
all. It basically exists to just queue async RPC jobs (which are
themselves workqueue tasks). Maybe it'd be simpler to just queue the
RPC tasks directly at the points where we call nfsd4_run_cb() today?

That's a bit more of an overhaul though, particularly in the callback
channel probing codepath.

> 
> > > > That will have incremented cl_cb_inflight() an extra
> > > > time and so your patch will make sense for that.
> > > > 
> > > > Unfortunately, the slot may leak in that case if nothing released it
> > > > earlier. I think this probably needs to call nfsd41_destroy_cb() if the
> > > > job can't be queued. That might, however, race with the callback
> > > > workqueue job running.
> > > > 
> > > > I think we might need to consider adding some sort of "this callback is
> > > > running" atomic flag: do a test_and_set on the flag in nfsd4_run_cb()
> > > > and only queue the workqueue job if that comes back false. Then, we can
> > > > clear the bit in nfsd41_destroy_cb().
> > > > 
> > > > That should ensure that you never fail to queue the workqueue job from
> > > > nfsd4_cb_release().
> > > > 
> > > > Thoughts?
> > > 
> > > 
> > 
> 
> 

-- 
Jeff Layton <jlayton@...nel.org>