linux-kernel - Re: [PATCH RESEND v2 5/5] sbitmap: correct wake_batch recalculation to avoid potential IO hung

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d00297d7-a77a-770a-1cd7-1632f8ae77e0@huaweicloud.com>
Date:   Mon, 26 Dec 2022 15:50:58 +0800
From:   Yu Kuai <yukuai1@...weicloud.com>
To:     Jan Kara <jack@...e.cz>, Kemeng Shi <shikemeng@...weicloud.com>
Cc:     axboe@...nel.dk, linux-block@...r.kernel.org,
        linux-kernel@...r.kernel.org, kbusch@...nel.org,
        Laibin Qiu <qiulaibin@...wei.com>,
        "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH RESEND v2 5/5] sbitmap: correct wake_batch recalculation
 to avoid potential IO hung

Hi,

在 2022/12/22 21:41, Jan Kara 写道:
> On Thu 22-12-22 22:33:53, Kemeng Shi wrote:
>> Commit 180dccb0dba4f ("blk-mq: fix tag_get wait task can't be awakened")
>> mentioned that in case of shared tags, there could be just one real
>> active hctx(queue) because of lazy detection of tag idle. Then driver tag
>> allocation may wait forever on this real active hctx(queue) if wake_batch
>> is > hctx_max_depth where hctx_max_depth is available tags depth for the
>> actve hctx(queue). However, the condition wake_batch > hctx_max_depth is
>> not strong enough to avoid IO hung as the sbitmap_queue_wake_up will only
>> wake up one wait queue for each wake_batch even though there is only one
>> waiter in the woken wait queue. After this, there is only one tag to free
>> and wake_batch may not be reached anymore. Commit 180dccb0dba4f ("blk-mq:
>> fix tag_get wait task can't be awakened") methioned that driver tag
>> allocation may wait forever. Actually, the inactive hctx(queue) will be
>> truely idle after at most 30 seconds and will call blk_mq_tag_wakeup_all
>> to wake one waiter per wait queue to break the hung. But IO hung for 30
>> seconds is also not acceptable. Set batch size to small enough that depth
>> of the shared hctx(queue) is enough to wake up all of the queues like
>> sbq_calc_wake_batch do to fix this potential IO hung.
>>
>> Although hctx_max_depth will be clamped to at least 4 while wake_batch
>> recalculation does not do the clamp, the wake_batch will be always
>> recalculated to 1 when hctx_max_depth <= 4.
>>
>> Fixes: 180dccb0dba4 ("blk-mq: fix tag_get wait task can't be awakened")
>> Signed-off-by: Kemeng Shi <shikemeng@...weicloud.com>
> 
> So the condition in sbitmap_queue_recalculate_wake_batch() also seemed
> strange to me and the changelogs of commits 180dccb0dba4 and 10825410b95
> ("blk-mq: Fix wrong wakeup batch configuration which will cause hang")
> didn't add much confidence about the magic batch setting to 4. Let me add
> to CC original author of this code if he has any thoughts on why using
> wake batch of 4 is safe for cards with say 32 tags in case active_users is
> currently 32. Because I don't see why that is correct either.
> 

If I remember this correctly, the reason to use 4 here in the first
place is to avoid performance degradation. And for why this is safe
because 4 * 8 = 32. Someone is waiting for tag means 32 tags is all
grabbed, and wake batch of 4 will make sure at least 8 wait queues will
be awaken. It's right some waitqueue might only have one waiter, but I
don't think this will cause io hang.

Thanks,
Kuai
> 								Honza
> 
>> ---
>>   lib/sbitmap.c | 5 +----
>>   1 file changed, 1 insertion(+), 4 deletions(-)
>>
>> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
>> index b6d3bb1c3675..804fe99783e4 100644
>> --- a/lib/sbitmap.c
>> +++ b/lib/sbitmap.c
>> @@ -458,13 +458,10 @@ void sbitmap_queue_recalculate_wake_batch(struct sbitmap_queue *sbq,
>>   					    unsigned int users)
>>   {
>>   	unsigned int wake_batch;
>> -	unsigned int min_batch;
>>   	unsigned int depth = (sbq->sb.depth + users - 1) / users;
>>   
>> -	min_batch = sbq->sb.depth >= (4 * SBQ_WAIT_QUEUES) ? 4 : 1;
>> -
>>   	wake_batch = clamp_val(depth / SBQ_WAIT_QUEUES,
>> -			min_batch, SBQ_WAKE_BATCH);
>> +			1, SBQ_WAKE_BATCH);
>>   
>>   	WRITE_ONCE(sbq->wake_batch, wake_batch);
>>   }
>> -- 
>> 2.30.0
>>