linux-kernel - Re: [PATCH] block: plug attempts to batch allocate tags multiple times

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <345e9d6e-8bb2-3d43-4c3c-cc16fa7dd8c1@huaweicloud.com>
Date: Tue, 2 Sep 2025 16:47:25 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Xue He <xue01.he@...sung.com>, axboe@...nel.dk
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] block: plug attempts to batch allocate tags multiple
 times

Hi,

在 2025/09/01 16:22, Xue He 写道:
> From: hexue <xue01.he@...sung.com>
> 
> In the existing plug mechanism, tags are allocated in batches based on
> the number of requests. However, testing has shown that the plug only
> attempts batch allocation of tags once at the beginning of a batch of
> I/O operations. Since the tag_mask does not always have enough available
> tags to satisfy the requested number, a full batch allocation is not
> guaranteed to succeed each time. The remaining tags are then allocated
> individually (occurs frequently), leading to multiple single-tag
> allocation overheads.
> 
> This patch aims to allow the remaining I/O operations to retry batch
> allocation of tags, reducing the overhead caused by multiple
> individual tag allocations.
> 
> ------------------------------------------------------------------------
> test result
> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
> observed CPU improvements. The CPU usage of the original function
> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
> after modification.
> 
> Additionally, performance variations were observed on different devices.
> workload:randread
> blocksize:4k
> thread:1
> ------------------------------------------------------------------------
>                    PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
> native kernel     553k iops       633k iops        793k iops
> modified          553k iops       635k iops        801k iops
> 
> with Optane SSDs, the performance like
> two device one thread
> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
> 

How many hw_queues and how many tags in each hw_queues in your nvme?
I feel it's unlikely that tags can be exhausted, usually cpu will become
bottleneck first.
> base: 6.4 Million IOPS
> patch: 6.49 Million IOPS
> 
> two device two thread
> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
> 
> base: 7.34 Million IOPS
> patch: 7.48 Million IOPS
> -------------------------------------------------------------------------
> 
> Signed-off-by: hexue <xue01.he@...sung.com>
> ---
>   block/blk-mq.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index b67d6c02eceb..1fb280764b76 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>   	if (blk_queue_enter(q, flags))
>   		return NULL;
>   
> -	plug->nr_ios = 1;
> -
>   	rq = __blk_mq_alloc_requests(&data);
> +	plug->nr_ios = data.nr_tags;
> +
>   	if (unlikely(!rq))
>   		blk_queue_exit(q);
>   	return rq;
> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>   
>   	if (plug) {
>   		data.nr_tags = plug->nr_ios;
> -		plug->nr_ios = 1;
>   		data.cached_rqs = &plug->cached_rqs;
>   	}
>   
>   	rq = __blk_mq_alloc_requests(&data);
> +	if (plug)
> +		plug->nr_ios = data.nr_tags;
> +
>   	if (unlikely(!rq))
>   		rq_qos_cleanup(q, bio);
>   	return rq;
> 

In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
plug->nr_ios to 1 in this case.

What am I missing?

Thanks,
Kuai