linux-kernel - Re: [PATCH] block: plug attempts to batch allocate tags multiple times

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20250903084135.2860-1-xue01.he@samsung.com>
Date: Wed,  3 Sep 2025 08:41:35 +0000
From: Xue He <xue01.he@...sung.com>
To: yukuai1@...weicloud.com, axboe@...nel.dk
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
	yukuai3@...wei.com
Subject: Re: [PATCH] block: plug attempts to batch allocate tags multiple
 times

On 2025/09/02 08:47 AM, Yu Kuai wrote:
>On 2025/09/01 16:22, Xue He wrote:
......
>> This patch aims to allow the remaining I/O operations to retry batch
>> allocation of tags, reducing the overhead caused by multiple
>> individual tag allocations.
>> 
>> ------------------------------------------------------------------------
>> test result
>> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
>> observed CPU improvements. The CPU usage of the original function
>> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
>> after modification.
>> 
>> Additionally, performance variations were observed on different devices.
>> workload:randread
>> blocksize:4k
>> thread:1
>> ------------------------------------------------------------------------
>>                    PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
>> native kernel     553k iops       633k iops        793k iops
>> modified          553k iops       635k iops        801k iops
>> 
>> with Optane SSDs, the performance like
>> two device one thread
>> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>> 
>
>How many hw_queues and how many tags in each hw_queues in your nvme?
>I feel it's unlikely that tags can be exhausted, usually cpu will become
>bottleneck first.

the information of my nvme like this:
number of CPU: 16
memory: 16G
nvme nvme0: 16/0/16 default/read/poll queue
cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests
1023

In more precise terms, I think it is not that the tags are fully exhausted,
but rather that after scanning the bitmap for free bits, the remaining
contiguous bits are nsufficient to meet the requirement (have but not enough).
The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c.
                    get_mask = ((1UL << nr_tags) - 1) << nr;
                    if (nr_tags > 1) {
                            printk("before %ld\n", get_mask);
                    }
                    while (!atomic_long_try_cmpxchg(ptr, &val,
                                                      get_mask | val))
                            ;
                    get_mask = (get_mask & ~val) >> nr;

where during the batch acquisition of contiguous free bits, an atomic operation
is performed, resulting in the actual tag_mask obtained differing from the
originally requested one.

Am I missing something?

>> base: 6.4 Million IOPS
>> patch: 6.49 Million IOPS
>> 
>> two device two thread
>> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>> 
>> base: 7.34 Million IOPS
>> patch: 7.48 Million IOPS
>> -------------------------------------------------------------------------
>> 
>> Signed-off-by: hexue <xue01.he@...sung.com>
>> ---
>>   block/blk-mq.c | 8 +++++---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>> 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index b67d6c02eceb..1fb280764b76 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>>   	if (blk_queue_enter(q, flags))
>>   		return NULL;
>>   
>> -	plug->nr_ios = 1;
>> -
>>   	rq = __blk_mq_alloc_requests(&data);
>> +	plug->nr_ios = data.nr_tags;
>> +
>>   	if (unlikely(!rq))
>>   		blk_queue_exit(q);
>>   	return rq;
>> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>>   
>>   	if (plug) {
>>   		data.nr_tags = plug->nr_ios;
>> -		plug->nr_ios = 1;
>>   		data.cached_rqs = &plug->cached_rqs;
>>   	}
>>   
>>   	rq = __blk_mq_alloc_requests(&data);
>> +	if (plug)
>> +		plug->nr_ios = data.nr_tags;
>> +
>>   	if (unlikely(!rq))
>>   		rq_qos_cleanup(q, bio);
>>   	return rq;
>> 
>
>In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
>data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
>plug->nr_ios to 1 in this case.
>
>What am I missing?

yes, you are right, if __blk_mq_alloc_requests_batch() failed, it will set
to 1. However, in this case, it did not fail to execute; instead, the
allocated number of tags was insufficient, as only a partial number were
allocated. Therefore, the function is considered successfully executed.

>Thanks,
>Kuai
>

Thanks,
Xue