linux-kernel - Re: [PATCH] block: plug attempts to batch allocate tags multiple times

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bfc12406-e608-b3fa-45e7-2105d9cc15bf@huaweicloud.com>
Date: Wed, 3 Sep 2025 17:35:40 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Xue He <xue01.he@...sung.com>, yukuai1@...weicloud.com, axboe@...nel.dk
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: [PATCH] block: plug attempts to batch allocate tags multiple
 times

Hi,

在 2025/09/03 16:41, Xue He 写道:
> On 2025/09/02 08:47 AM, Yu Kuai wrote:
>> On 2025/09/01 16:22, Xue He wrote:
> ......
>>> This patch aims to allow the remaining I/O operations to retry batch
>>> allocation of tags, reducing the overhead caused by multiple
>>> individual tag allocations.
>>>
>>> ------------------------------------------------------------------------
>>> test result
>>> During testing of the PCIe Gen4 SSD Samsung PM9A3, the perf tool
>>> observed CPU improvements. The CPU usage of the original function
>>> _blk_mq_alloc_requests function was 1.39%, which decreased to 0.82%
>>> after modification.
>>>
>>> Additionally, performance variations were observed on different devices.
>>> workload:randread
>>> blocksize:4k
>>> thread:1
>>> ------------------------------------------------------------------------
>>>                     PCIe Gen3 SSD   PCIe Gen4 SSD    PCIe Gen5 SSD
>>> native kernel     553k iops       633k iops        793k iops
>>> modified          553k iops       635k iops        801k iops
>>>
>>> with Optane SSDs, the performance like
>>> two device one thread
>>> cmd :sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>>>
>>
>> How many hw_queues and how many tags in each hw_queues in your nvme?
>> I feel it's unlikely that tags can be exhausted, usually cpu will become
>> bottleneck first.
> 
> the information of my nvme like this:
> number of CPU: 16
> memory: 16G
> nvme nvme0: 16/0/16 default/read/poll queue
> cat /sys/class/nvme/nvme0/nvme0n1/queue/nr_requests
> 1023
> 
> In more precise terms, I think it is not that the tags are fully exhausted,
> but rather that after scanning the bitmap for free bits, the remaining
> contiguous bits are nsufficient to meet the requirement (have but not enough).
> The specific function involved is __sbitmap_queue_get_batch in lib/sbitmap.c.
>                      get_mask = ((1UL << nr_tags) - 1) << nr;
>                      if (nr_tags > 1) {
>                              printk("before %ld\n", get_mask);
>                      }
>                      while (!atomic_long_try_cmpxchg(ptr, &val,
>                                                        get_mask | val))
>                              ;
>                      get_mask = (get_mask & ~val) >> nr;
> 
> where during the batch acquisition of contiguous free bits, an atomic operation
> is performed, resulting in the actual tag_mask obtained differing from the
> originally requested one.

Yes, so this function will likely to obtain less tags than nr_tags,the
mask is always start from first zero bit with nr_tags bit, and
sbitmap_deferred_clear() is called uncondionally, it's likely there are
non-zero bits within this range.

Just wonder, do you consider fixing this directly in
__blk_mq_alloc_requests_batch()?

  - call sbitmap_deferred_clear() and retry on allocation failure, so
that the whole word can be used even if previous allocated request are
done, especially for nvme with huge tag depths;
  - retry blk_mq_get_tags() until data->nr_tags is zero;
> 
> Am I missing something?
> 
>>> base: 6.4 Million IOPS
>>> patch: 6.49 Million IOPS
>>>
>>> two device two thread
>>> cmd: sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1
>>> -n1 -r4 /dev/nvme0n1 /dev/nvme1n1
>>>
>>> base: 7.34 Million IOPS
>>> patch: 7.48 Million IOPS
>>> -------------------------------------------------------------------------
>>>
>>> Signed-off-by: hexue <xue01.he@...sung.com>
>>> ---
>>>    block/blk-mq.c | 8 +++++---
>>>    1 file changed, 5 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index b67d6c02eceb..1fb280764b76 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -587,9 +587,9 @@ static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
>>>    	if (blk_queue_enter(q, flags))
>>>    		return NULL;
>>>    
>>> -	plug->nr_ios = 1;
>>> -
>>>    	rq = __blk_mq_alloc_requests(&data);
>>> +	plug->nr_ios = data.nr_tags;
>>> +
>>>    	if (unlikely(!rq))
>>>    		blk_queue_exit(q);
>>>    	return rq;
>>> @@ -3034,11 +3034,13 @@ static struct request *blk_mq_get_new_requests(struct request_queue *q,
>>>    
>>>    	if (plug) {
>>>    		data.nr_tags = plug->nr_ios;
>>> -		plug->nr_ios = 1;
>>>    		data.cached_rqs = &plug->cached_rqs;
>>>    	}
>>>    
>>>    	rq = __blk_mq_alloc_requests(&data);
>>> +	if (plug)
>>> +		plug->nr_ios = data.nr_tags;
>>> +
>>>    	if (unlikely(!rq))
>>>    		rq_qos_cleanup(q, bio);
>>>    	return rq;
>>>
>>
>> In __blk_mq_alloc_requests(), if __blk_mq_alloc_requests_batch() failed,
>> data->nr_tags is set to 1, so plug->nr_ios = data.nr_tags will still set
>> plug->nr_ios to 1 in this case.
>>
>> What am I missing?
> 
> yes, you are right, if __blk_mq_alloc_requests_batch() failed, it will set
> to 1. However, in this case, it did not fail to execute; instead, the
> allocated number of tags was insufficient, as only a partial number were
> allocated. Therefore, the function is considered successfully executed.
> 

Thanks for the explanation, I understand this now.

Thanks,
Kuai

>> Thanks,
>> Kuai
>>
> 
> Thanks,
> Xue
> 
> .
>