[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5adb469d-9e4b-e2d9-a77c-a1a4d11a49d5@huaweicloud.com>
Date: Fri, 29 Aug 2025 09:09:45 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Jens Axboe <axboe@...nel.dk>, Li Nan <linan666@...weicloud.com>,
Ming Lei <ming.lei@...hat.com>
Cc: Yu Kuai <yukuai1@...weicloud.com>, jianchao.w.wang@...cle.com,
linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
yangerkun@...wei.com, yi.zhang@...wei.com, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in
blk_mq_unregister_hctx
Hi,
在 2025/08/29 1:23, Jens Axboe 写道:
> On 8/28/25 3:28 AM, Li Nan wrote:
>>
>>
>> ? 2025/8/27 16:10, Ming Lei ??:
>>> On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
>>>>
>>>>
>>>> ? 2025/8/27 9:35, Ming Lei ??:
>>>>> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
>>>>>> Hi,
>>>>>>
>>>>>> ? 2025/08/27 8:58, Ming Lei ??:
>>>>>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@...weicloud.com wrote:
>>>>>>>> From: Li Nan <linan122@...wei.com>
>>>>>>>>
>>>>>>>> In __blk_mq_update_nr_hw_queues() the return value of
>>>>>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
>>>>>>>
>>>>>>> Looks we should check its return value and handle the failure in both
>>>>>>> the call site and blk_mq_sysfs_register_hctxs().
>>>>>>
>>>>>> From __blk_mq_update_nr_hw_queues(), the old hctxs is already
>>>>>> unregistered, and this function is void, we failed to register new hctxs
>>>>>> because of memory allocation failure. I really don't know how to handle
>>>>>> the failure here, do you have any suggestions?
>>>>>
>>>>> It is out of memory, I think it is fine to do whatever to leave queue state
>>>>> intact instead of making it `partial workable`, such as:
>>>>>
>>>>> - try update nr_hw_queues to 1
>>>>>
>>>>> - if it still fails, delete disk & mark queue as dead if disk is attached
>>>>>
>>>>
>>>> If we ignore these non-critical sysfs creation failures, the disk remains
>>>> usable with no loss of functionality. Deleting the disk seems to escalate
>>>> the error?
>>>
>>> It is more like a workaround by ignoring the sysfs register failure. And if
>>> the issue need to be fixed in this way, you have to document it. >
>>> In case of OOM, it usually means that the system isn't usable any more.
>>> But it is NOIO allocation and the typical use case is for error recovery in
>>> nvme pci, so there may not be enough pages for noio allocation only. That is
>>> the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?
>>>
>>> But NVMe has been pretty fragile in this area by using non-owner queue
>>> freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
>>> really necessary to take it into account?
>>
>> I agree with your points about NOIO and NVMe.
>>
>> I hit this issue in null_blk during fuzz testing with memory-fault
>> injection. Changing the number of hardware queues under OOM is
>> extremely rare in real-world usage. So I think adding a workaround and
>> documenting it is sufficient. What do you think?
>
> Working around it is fine, as it isn't a situation we really need to
> worry about. But let's please not do it by poking at kobject internals.
>
There is already used in someplaces like sysfs_slab_unlink().
Do we prefre add a new hctx->state like BLK_MQ_S_REGISTERED?
Thanks,
Kuai
Powered by blists - more mailing lists