linux-kernel - [PATCH v3 0/5] blk-mq-sched: support request batch dispatching for sq elevator

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20250806085720.4040507-1-yukuai1@huaweicloud.com>
Date: Wed,  6 Aug 2025 16:57:15 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: dlemoal@...nel.org,
	hare@...e.de,
	jack@...e.cz,
	bvanassche@....org,
	tj@...nel.org,
	josef@...icpanda.com,
	axboe@...nel.dk,
	yukuai3@...wei.com
Cc: cgroups@...r.kernel.org,
	linux-block@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	yukuai1@...weicloud.com,
	yi.zhang@...wei.com,
	yangerkun@...wei.com,
	johnny.chenyi@...wei.com
Subject: [PATCH v3 0/5] blk-mq-sched: support request batch dispatching for sq elevator

From: Yu Kuai <yukuai3@...wei.com>

Changes from v2:
 - add elevator lock/unlock macros in patch 1;
 - improve coding style and commit messages;
 - retest with a new environment
 - add test for scsi HDD and nvme;

Changes from v1:
 - the ioc changes are send separately;
 - change the patch 1-3 order as suggested by Damien;

Currently, both mq-deadline and bfq have global spin lock that will be
grabbed inside elevator methods like dispatch_request, insert_requests,
and bio_merge. And the global lock is the main reason mq-deadline and
bfq can't scale very well.

For dispatch_request method, current behavior is dispatching one request at
a time. In the case of multiple dispatching contexts, This behavior, on the
one hand, introduce intense lock contention:

t1:                     t2:                     t3:
lock                    lock                    lock
// grab lock
ops.dispatch_request
unlock
                        // grab lock
                        ops.dispatch_request
                        unlock
                                                // grab lock
                                                ops.dispatch_request
                                                unlock

on the other hand, messing up the requests dispatching order:
t1:

lock
rq1 = ops.dispatch_request
unlock
                        t2:
                        lock
                        rq2 = ops.dispatch_request
                        unlock

lock
rq3 = ops.dispatch_request
unlock

                        lock
                        rq4 = ops.dispatch_request
                        unlock

//rq1,rq3 issue to disk
                        // rq2, rq4 issue to disk

In this case, the elevator dispatch order is rq 1-2-3-4, however,
such order in disk is rq 1-3-2-4, the order for rq2 and rq3 is inversed.

While dispatching request, blk_mq_get_disatpch_budget() and
blk_mq_get_driver_tag() must be called, and they are not ready to be
called inside elevator methods, hence introduce a new method like
dispatch_requests is not possible.

In conclusion, this set factor the global lock out of dispatch_request
method, and support request batch dispatch by calling the methods
multiple time while holding the lock.

Test Environment:
arm64 Kunpeng-920, with 4 nodes 128 cores
nvme: HWE52P431T6M005N
scsi HDD: MG04ACA600E attached to hisi_sas_v3

null_blk set up:

modprobe null_blk nr_devices=0 &&
    udevadm settle &&
    cd /sys/kernel/config/nullb &&
    mkdir nullb0 &&
    cd nullb0 &&
    echo 0 > completion_nsec &&
    echo 512 > blocksize &&
    echo 0 > home_node &&
    echo 0 > irqmode &&
    echo 128 > submit_queues &&
    echo 1024 > hw_queue_depth &&
    echo 1024 > size &&
    echo 0 > memory_backed &&
    echo 2 > queue_mode &&
    echo 1 > power ||
    exit $?

null_blk and nvme test script:

[global]
filename=/dev/{nullb0,nvme0n1}
rw=randwrite
bs=4k
iodepth=32
iodepth_batch_submit=8
iodepth_batch_complete=8
direct=1
ioengine=io_uring
time_based

[write]
numjobs=16
runtime=60

scsi HDD test script: noted this test aims to test if batch dispatch
will affect IO merge.

[global]
filename=/dev/sda
rw=write
bs=4k
iodepth=32
iodepth_batch_submit=1
direct=1
ioengine=libaio

[write]
offset_increment=1g
numjobs=128

Test Result:
1) nullblk: iops test with high IO pressue
|                 | deadline | bfq      |
| --------------- | -------- | -------- |
| before this set | 256k     | 153k     |
| after this set  | 594k     | 283k     |

2) nvme: iops test with high IO pressue
|                 | deadline | bfq      |
| --------------- | -------- | -------- |
| before this set | 258k     | 142k     |
| after this set  | 568k     | 214k     |

3) scsi HDD: io merge test, elevator is deadline
|                 | w/s   | %wrqm | wareq-sz | aqu-sz |
| --------------- | ----- | ----- | -------- | ------ |
| before this set | 92.25 | 96.88 | 128      | 129    |
| after this set  | 92.63 | 96.88 | 128      | 129    |

Yu Kuai (5):
  blk-mq-sched: introduce high level elevator lock
  mq-deadline: switch to use elevator lock
  block, bfq: switch to use elevator lock
  blk-mq-sched: refactor __blk_mq_do_dispatch_sched()
  blk-mq-sched: support request batch dispatching for sq elevator

 block/bfq-cgroup.c   |   6 +-
 block/bfq-iosched.c  |  53 +++++-----
 block/bfq-iosched.h  |   2 -
 block/blk-mq-sched.c | 246 ++++++++++++++++++++++++++++++-------------
 block/blk-mq.h       |  21 ++++
 block/elevator.c     |   1 +
 block/elevator.h     |  14 ++-
 block/mq-deadline.c  |  60 +++++------
 8 files changed, 263 insertions(+), 140 deletions(-)

-- 
2.39.2