linux-kernel - Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fc587a1a-97fb-584c-c17c-13bb5e3d7a92@huaweicloud.com>
Date: Thu, 28 Aug 2025 17:28:26 +0800
From: Li Nan <linan666@...weicloud.com>
To: Ming Lei <ming.lei@...hat.com>, Li Nan <linan666@...weicloud.com>
Cc: Yu Kuai <yukuai1@...weicloud.com>, axboe@...nel.dk,
 jianchao.w.wang@...cle.com, linux-block@...r.kernel.org,
 linux-kernel@...r.kernel.org, yangerkun@...wei.com, yi.zhang@...wei.com,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] blk-mq: check kobject state_in_sysfs before deleting in
 blk_mq_unregister_hctx



在 2025/8/27 16:10, Ming Lei 写道:
> On Wed, Aug 27, 2025 at 11:22:06AM +0800, Li Nan wrote:
>>
>>
>> 在 2025/8/27 9:35, Ming Lei 写道:
>>> On Wed, Aug 27, 2025 at 09:04:45AM +0800, Yu Kuai wrote:
>>>> Hi,
>>>>
>>>> 在 2025/08/27 8:58, Ming Lei 写道:
>>>>> On Tue, Aug 26, 2025 at 04:48:54PM +0800, linan666@...weicloud.com wrote:
>>>>>> From: Li Nan <linan122@...wei.com>
>>>>>>
>>>>>> In __blk_mq_update_nr_hw_queues() the return value of
>>>>>> blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
>>>>>
>>>>> Looks we should check its return value and handle the failure in both
>>>>> the call site and blk_mq_sysfs_register_hctxs().
>>>>
>>>>   From __blk_mq_update_nr_hw_queues(), the old hctxs is already
>>>> unregistered, and this function is void, we failed to register new hctxs
>>>> because of memory allocation failure. I really don't know how to handle
>>>> the failure here, do you have any suggestions?
>>>
>>> It is out of memory, I think it is fine to do whatever to leave queue state
>>> intact instead of making it `partial workable`, such as:
>>>
>>> - try update nr_hw_queues to 1
>>>
>>> - if it still fails, delete disk & mark queue as dead if disk is attached
>>>
>>
>> If we ignore these non-critical sysfs creation failures, the disk remains
>> usable with no loss of functionality. Deleting the disk seems to escalate
>> the error?
> 
> It is more like a workaround by ignoring the sysfs register failure. And if
> the issue need to be fixed in this way, you have to document it. >
> In case of OOM, it usually means that the system isn't usable any more.
> But it is NOIO allocation and the typical use case is for error recovery in
> nvme pci, so there may not be enough pages for noio allocation only. That is
> the reason for ignoring sysfs register in blk_mq_update_nr_hw_queues()?
> 
> But NVMe has been pretty fragile in this area by using non-owner queue
> freeze, and call blk_mq_update_nr_hw_queues() on frozen queue, so it is
> really necessary to take it into account?

I agree with your points about NOIO and NVMe.

I hit this issue in null_blk during fuzz testing with memory-fault
injection. Changing the number of hardware queues under OOM is extremely 
rare in real-world usage. So I think adding a workaround and documenting it
is sufficient. What do you think?

> 
> Thanks,
> Ming
> 


-- 
Thanks,
Nan