linux-kernel - Re: [RFC] block: The Effectiveness of Plug Optimization?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <0bbb87cb-5774-4d50-86d3-eb118ebd3f1d@kernel.dk>
Date: Wed, 18 Jun 2025 10:25:37 -0600
From: Jens Axboe <axboe@...nel.dk>
To: hexue <xue01.he@...sung.com>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC] block: The Effectiveness of Plug Optimization?

On 6/17/25 11:04 PM, hexue wrote:
> The plug mechanism uses the merging of block I/O (bio) to reduce the
> frequency of I/O submission to improve throughput. This mechanism can
> greatly reduce the disk seek overhead of the HDD and plays a key role
> in optimizing the speed of IO. However, with the improvement of
> storage device speed, high-performance SSD combined with asynchronous
> processing mechanisms such as io_uring has achieved very fast I/O
> processing speed. The delay introduced by flow control and bio merging
> may reduced the throughput to a certain extent.
> 
> After testing, I found that plug increases the burden of high
> concurrency of SSD on random IO and 128K sequential IO. But it still
> has a certain optimization effect on small block (4k) sequential IO,
> of course small sequential IO is the most suitable application for
> merging scenarios, but the current plug does not distinguish between
> different usage scenarios.
> 
> I have made aggressive modifications to the kernel code to disable the
> plug mechanism during I/O submission, the following are the random
> performance differences after disabling only merging and completely
> disabling plug (merging and flow control):
> 
> ------------------------------------------------------------------------------------
> PCIe Gen4 SSD 
> 16GB Mem
> Seq 128K
> Random 4K
> cmd: 
> taskset -c 0 ./t/io_uring -b 131072 -d128 -c32 -s32 -R0 -p1 -F1 -B1 -n1 -r5 /dev/nvme0n1
> taskset -c 0 ./t/io_uring -b 4096 -d128 -c32 -s32 -R1 -p1 -F1 -B1 -n1 -r5 /dev/nvme0n1
> data unit: IOPS
> ------------------------------------------------------------------------------------
>              Enable plug          disable merge           disable plug
> Seq IO       50100                50133                   50125
> Random IO    821K                 824K                    836K           -1.83%
> ------------------------------------------------------------------------------------
> 
> I used a higher-speed device (PCIe Gen5 server and PCIe Gen5 SSD) to verify the hypothesis
> and found that the gap widened further.
> 
> ------------------------------------------------------------------------------------
>              Enable plug          disable merge           disable plug
> Seq IO       88938                89832                   89869
> Random IO    1.02M                1.022M                  1.06M          -3.92%
> ------------------------------------------------------------------------------------
> 
> In the current kernel, there is a certain flag (REQ_NOMERGE_FLAGS) to
> control whether IO operations can be merged. However, the decision for
> plug selection is determined solely by whether batch submission is
> enabled (state->need_plug = max_ios > 2;). I'm wondering whether this
> judgment mechanism is still applicable to high-speed SSDs.

1M isn't really high speed, it's "normal" speed. When I did my previous
testing, I used gen2 optane, which does 5M per device. But even flash
based devices these days do 3M+ IOPS.

> So the discussion points are:
> 	- Will plugs gradually disappear as hardware devices develop?
> 	- Is it reasonable to make flow control an optional
> 	configuration? Or could we change the criteria for determining
> 	when to apply plug?
> 	- Are there other thoughts about plug that we can talk now?

Those results are odd. For plugging, the main wins should be grabbing
batches of tags and using ->queue_rqs for queueing the IO on the device
side. In my past testing, those are a major win. It's also been a while
since I've run peak testing, so it's also very possible that we've
regressed there, unknowingly, and that's why you're not seeing any wins
from plugging.

For your test case, we should allocate 32 tags once, and then use those
32 tags for submission. That's a lot more efficient that allocating tags
one-by-one as IO gets queued. And on the NVMe side, we should be
submitting batches of 32 requests per doorbell write.

Timestamp reductions are also generally a nice win. But without knowing
what your profiles look like, it's impossible to say what's going on at
your end. A lot more details on your runs would be required.

Rather than pontificate on getting rid of plugging, I'd much rather see
some investigation going into whether these optimizations are still
happening as they should be. And if the answer is no, then what broken
it? If the answer is yes, then why isn't it providing the speedup that
it should.

-- 
Jens Axboe