linux-kernel - Re: [bug report] deploying both NFS client and server on the same machine triggle hungtask

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9420a368-8d18-4920-b196-a65cb265a26a@huawei.com>
Date: Tue, 26 Nov 2024 10:28:49 +0800
From: Li Lingfeng <lilingfeng3@...wei.com>
To: Mark Liam Brown <brownmarkliam@...il.com>, <linux-nfs@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>
CC: yangerkun <yangerkun@...wei.com>, "zhangyi (F)" <yi.zhang@...wei.com>,
	"yukuai (C)" <yukuai3@...wei.com>, <chengzhihao1@...wei.com>, Hou Tao
	<houtao1@...wei.com>
Subject: Re: [bug report] deploying both NFS client and server on the same
 machine triggle hungtask


在 2024/11/26 1:32, Mark Liam Brown 写道:
> On Mon, Nov 25, 2024 at 1:48 PM Li Lingfeng <lilingfeng3@...wei.com> wrote:
>> Hi, we have found a hungtask issue recently.
>>
>> Commit 7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low
>> memory condition") adds a shrinker to NFSD, which causes NFSD to try to
>> obtain shrinker_rwsem when starting and stopping services.
>>
>> Deploying both NFS client and server on the same machine may lead to the
>> following issue, since they will share the global shrinker_rwsem.
>>
>>       nfsd                            nfs
>>                               drop_cache // hold shrinker_rwsem
>>                               write back, wait for rpc_task to exit
>> // stop nfsd threads
>> svc_set_num_threads
>> // clean up xprts
>> svc_xprt_destroy_all
>>                               rpc_check_timeout
>>                                rpc_check_connected
>>                                // wait for the connection to be disconnected
>> unregister_shrinker
>> // wait for shrinker_rwsem
>>
>> Normally, the client's rpc_task will exit after the server's nfsd thread
>> has processed the request.
>> When all the server's nfsd threads exit, the client’s rpc_task is expected
>> to detect the network connection being disconnected and exit.
>> However, although the server has executed svc_xprt_destroy_all before
>> waiting for shrinker_rwsem, the network connection is not actually
>> disconnected. Instead, the operation to close the socket is simply added
>> to the task_works queue.
>>
>> svc_xprt_destroy_all
>>    ...
>>    svc_sock_free
>>     sockfd_put
>>      fput_many
>>       init_task_work // ____fput
>>       task_work_add // add to task->task_works
>>
>> The actual disconnection of the network connection will only occur after
>> the current process finishes.
>> do_exit
>>    exit_task_work
>>     task_work_run
>>     ...
>>      ____fput // close sock
>>
>> Although it is not a common practice to deploy NFS client and server on
>> the same machine, I think this issue still needs to be addressed,
>> otherwise it will cause all processes trying to acquire the shrinker_rwsem
>> to hang.
> I disagree with that comment. Most small companies have NFS client and
> NFS server on the same machine, the client being used to allow logins
> by users, or to support schroot or containers.
>
> Mark

Sorry for my hasty conclusion.

By the way, nfsd_reply_cache_shrinker triggers this too.

Li