linux-kernel - Re: [bug report] deploying both NFS client and server on the same machine triggle hungtask

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8b155d3c-62b4-4f16-ab00-e3d030148d29@huawei.com>
Date: Thu, 28 Nov 2024 15:22:04 +0800
From: Li Lingfeng <lilingfeng3@...wei.com>
To: <Dai.Ngo@...cle.com>, Chuck Lever <chuck.lever@...cle.com>, Jeff Layton
	<jlayton@...nel.org>, NeilBrown <neilb@...e.de>, <okorniev@...hat.com>,
	<tom@...pey.com>, <trond.myklebust@...merspace.com>
CC: <linux-nfs@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Yu Kuai
	<yukuai1@...weicloud.com>, Hou Tao <houtao1@...wei.com>, "zhangyi (F)"
	<yi.zhang@...wei.com>, yangerkun <yangerkun@...wei.com>,
	<chengzhihao1@...wei.com>, Li Lingfeng <lilingfeng@...weicloud.com>
Subject: Re: [bug report] deploying both NFS client and server on the same
 machine triggle hungtask

Besides nfsd_file_shrinker, the nfsd_client_shrinker added by commit
7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low memory
condition") in 2022 and the nfsd_reply_cache_shrinker added by commit
3ba75830ce17 ("nfsd4: drc containerization") in 2019 may also trigger such
an issue.
Was this scenario not considered when designing the shrinkers for NFSD, or
was it deemed unreasonable and not worth considering?

在 2024/11/25 19:17, Li Lingfeng 写道:
> Hi, we have found a hungtask issue recently.
>
> Commit 7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low
> memory condition") adds a shrinker to NFSD, which causes NFSD to try to
> obtain shrinker_rwsem when starting and stopping services.
>
> Deploying both NFS client and server on the same machine may lead to the
> following issue, since they will share the global shrinker_rwsem.
>
>     nfsd                            nfs
>                             drop_cache // hold shrinker_rwsem
>                             write back, wait for rpc_task to exit
> // stop nfsd threads
> svc_set_num_threads
> // clean up xprts
> svc_xprt_destroy_all
>                             rpc_check_timeout
>                              rpc_check_connected
>                              // wait for the connection to be 
> disconnected
> unregister_shrinker
> // wait for shrinker_rwsem
>
> Normally, the client's rpc_task will exit after the server's nfsd thread
> has processed the request.
> When all the server's nfsd threads exit, the client’s rpc_task is 
> expected
> to detect the network connection being disconnected and exit.
> However, although the server has executed svc_xprt_destroy_all before
> waiting for shrinker_rwsem, the network connection is not actually
> disconnected. Instead, the operation to close the socket is simply added
> to the task_works queue.
>
> svc_xprt_destroy_all
>  ...
>  svc_sock_free
>   sockfd_put
>    fput_many
>     init_task_work // ____fput
>     task_work_add // add to task->task_works
>
> The actual disconnection of the network connection will only occur after
> the current process finishes.
> do_exit
>  exit_task_work
>   task_work_run
>   ...
>    ____fput // close sock
>
> Although it is not a common practice to deploy NFS client and server on
> the same machine, I think this issue still needs to be addressed,
> otherwise it will cause all processes trying to acquire the 
> shrinker_rwsem
> to hang.
>
> I don't have any ideas yet on how to solve this problem, does anyone have
> any suggestions?
>
> Thanks.
>