linux-kernel - Re: [PATCH -next] blk-mq: fix boot time regression for scsi drives with multiple hctx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <01cb0e49-1154-33db-f572-3960c972fe08@huawei.com>
Date:   Tue, 14 Jun 2022 21:15:36 +0800
From:   Yu Kuai <yukuai3@...wei.com>
To:     Ming Lei <ming.lei@...hat.com>
CC:     <axboe@...nel.dk>, <djeffery@...hat.com>, <bvanassche@....org>,
        <linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <yi.zhang@...wei.com>
Subject: Re: [PATCH -next] blk-mq: fix boot time regression for scsi drives
 with multiple hctx

在 2022/06/14 15:31, Ming Lei 写道:
> On Tue, Jun 14, 2022 at 03:14:10PM +0800, Yu Kuai wrote:
>> We found that boot time is increased for about 8s after upgrading kernel
>> from v4.19 to v5.10(megaraid-sas is used in the environment).
> 
> But 'blk-mq: clearing flush request reference in tags->rqs[]' was merged
> to v5.14, :-)
Hi,

Yes, but this patch is applied to 5.10 stable, thus we backport in our
v5.10. Sorry that I didn't mention that.

> 
>>
>> Following is where the extra time is spent:
>>
>> 
>>   __scsi_remove_device
>>    blk_cleanup_queue
>>     blk_mq_exit_queue
>>      blk_mq_exit_hw_queues
>>       blk_mq_exit_hctx
>>        blk_mq_clear_flush_rq_mapping -> function latency is 0.1ms
> 
> So queue_depth looks pretty long, is it 4k?
No, in the environment, it's just 32, and nr_hw_queues is 128, which
means each blk_cleanup_queue() will cost about 10-20 ms.

> 
> But if it is 0.1ms, how can the 8sec delay be caused? That requires 80K hw queues
> for making so long, so I guess there must be other delay added by the feature
> of BLK_MQ_F_TAG_HCTX_SHARED.

Please see details in the reasons 2), scsi scan will call
__scsi_remove_device() a lot of times(each host, each channel, each
target).
> 
>>         cmpxchg
>>
>> There are three reasons:
>> 1) megaraid-sas is using multiple hctxs in v5.10, thus blk_mq_exit_hctx()
>> will be called much more times in v5.10 compared to v4.19.
>> 2) scsi will scan for each target thus __scsi_remove_device() will be
>> called for many times.
>> 3) blk_mq_clear_flush_rq_mapping() is introduced after v4.19, it will
>> call cmpxchg() for each request, and function latency is abount 0.1ms.
>>
>> Since that blk_mq_clear_flush_rq_mapping() will only be called while the
>> queue is freezed already, which means there is no inflight request,
>> it's safe to set NULL for 'tags->rqs[]' directly instead of using
>> cmpxchg(). Tests show that with this change, function latency of
>> blk_mq_clear_flush_rq_mapping() is about 1us, and boot time is not
>> increased.
> 
> tags is shared among all LUNs attached to the host, so freezing single
> request queue here means nothing, so your patch doesn't work.

You'are right, I forgot about that tags can be shared.

> 
> Please test the following patch, and see if it can improve boot delay for
> your case.
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e9bf950983c7..1463076a527c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3443,8 +3443,9 @@ static void blk_mq_exit_hctx(struct request_queue *q,
>   	if (blk_mq_hw_queue_mapped(hctx))
>   		blk_mq_tag_idle(hctx);
>   
> -	blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
> -			set->queue_depth, flush_rq);
> +	if (blk_queue_init_done(q))
> +		blk_mq_clear_flush_rq_mapping(set->tags[hctx_idx],
> +				set->queue_depth, flush_rq);
>   	if (set->ops->exit_request)
>   		set->ops->exit_request(set, flush_rq, hctx_idx);
>   

Thanks for the patch, I test it and boot delay is fixed.

Kuai
> 
> 
> Thanks,
> Ming
> 
> .
>