[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9420a368-8d18-4920-b196-a65cb265a26a@huawei.com>
Date: Tue, 26 Nov 2024 10:28:49 +0800
From: Li Lingfeng <lilingfeng3@...wei.com>
To: Mark Liam Brown <brownmarkliam@...il.com>, <linux-nfs@...r.kernel.org>,
<linux-kernel@...r.kernel.org>
CC: yangerkun <yangerkun@...wei.com>, "zhangyi (F)" <yi.zhang@...wei.com>,
"yukuai (C)" <yukuai3@...wei.com>, <chengzhihao1@...wei.com>, Hou Tao
<houtao1@...wei.com>
Subject: Re: [bug report] deploying both NFS client and server on the same
machine triggle hungtask
在 2024/11/26 1:32, Mark Liam Brown 写道:
> On Mon, Nov 25, 2024 at 1:48 PM Li Lingfeng <lilingfeng3@...wei.com> wrote:
>> Hi, we have found a hungtask issue recently.
>>
>> Commit 7746b32f467b ("NFSD: add shrinker to reap courtesy clients on low
>> memory condition") adds a shrinker to NFSD, which causes NFSD to try to
>> obtain shrinker_rwsem when starting and stopping services.
>>
>> Deploying both NFS client and server on the same machine may lead to the
>> following issue, since they will share the global shrinker_rwsem.
>>
>> nfsd nfs
>> drop_cache // hold shrinker_rwsem
>> write back, wait for rpc_task to exit
>> // stop nfsd threads
>> svc_set_num_threads
>> // clean up xprts
>> svc_xprt_destroy_all
>> rpc_check_timeout
>> rpc_check_connected
>> // wait for the connection to be disconnected
>> unregister_shrinker
>> // wait for shrinker_rwsem
>>
>> Normally, the client's rpc_task will exit after the server's nfsd thread
>> has processed the request.
>> When all the server's nfsd threads exit, the client’s rpc_task is expected
>> to detect the network connection being disconnected and exit.
>> However, although the server has executed svc_xprt_destroy_all before
>> waiting for shrinker_rwsem, the network connection is not actually
>> disconnected. Instead, the operation to close the socket is simply added
>> to the task_works queue.
>>
>> svc_xprt_destroy_all
>> ...
>> svc_sock_free
>> sockfd_put
>> fput_many
>> init_task_work // ____fput
>> task_work_add // add to task->task_works
>>
>> The actual disconnection of the network connection will only occur after
>> the current process finishes.
>> do_exit
>> exit_task_work
>> task_work_run
>> ...
>> ____fput // close sock
>>
>> Although it is not a common practice to deploy NFS client and server on
>> the same machine, I think this issue still needs to be addressed,
>> otherwise it will cause all processes trying to acquire the shrinker_rwsem
>> to hang.
> I disagree with that comment. Most small companies have NFS client and
> NFS server on the same machine, the client being used to allow logins
> by users, or to support schroot or containers.
>
> Mark
Sorry for my hasty conclusion.
By the way, nfsd_reply_cache_shrinker triggers this too.
Li
Powered by blists - more mailing lists