[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250611121626.7252-1-abuehaze@amazon.com>
Date: Wed, 11 Jun 2025 12:14:54 +0000
From: Hazem Mohamed Abuelfotoh <abuehaze@...zon.com>
To:
CC: <abuehaze@...zon.com>, <stable@...r.kernel.org>, kernel test robot
<oliver.sang@...el.com>, Hagar Hemdan <hagarhem@...zon.com>, Shaoying Xu
<shaoyi@...zon.com>, Jens Axboe <axboe@...nel.dk>, "Michael S. Tsirkin"
<mst@...hat.com>, Jason Wang <jasowang@...hat.com>, Paolo Bonzini
<pbonzini@...hat.com>, Stefan Hajnoczi <stefanha@...hat.com>,
Eugenio Pérez <eperezma@...hat.com>, Xuan Zhuo
<xuanzhuo@...ux.alibaba.com>, Keith Busch <kbusch@...nel.org>, "Christoph
Hellwig" <hch@....de>, Sagi Grimberg <sagi@...mberg.me>,
<linux-block@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<virtualization@...ts.linux.dev>, <linux-nvme@...ts.infradead.org>
Subject: [PATCH] Revert "block: don't reorder requests in blk_add_rq_to_plug"
This reverts commit e70c301faece15b618e54b613b1fd6ece3dd05b4.
Commit <e70c301faece> ("block: don't reorder requests in
blk_add_rq_to_plug") reversed how requests are stored in the blk_plug
list, this had significant impact on bio merging with requests exist on
the plug list. This impact has been reported in [1] and could easily be
reproducible using 4k randwrite fio benchmark on an NVME based SSD without
having any filesystem on the disk.
My benchmark is:
fio --time_based --name=benchmark --size=50G --rw=randwrite \
--runtime=60 --filename="/dev/nvme1n1" --ioengine=psync \
--randrepeat=0 --iodepth=1 --fsync=64 --invalidate=1 \
--verify=0 --verify_fatal=0 --blocksize=4k --numjobs=4 \
--group_reporting
On 1.9TiB SSD(180K Max IOPS) attached to i3.16xlarge AWS EC2 instance.
Kernel | fio (B.W MiB/sec) | I/O size (iostat)
--------------+---------------------+--------------------
6.15.1 | 362 | 2KiB
6.15.1+revert | 660 (+82%) | 4KiB
--------------+---------------------+--------------------
I have run iostat while the fio benchmark was running and was able to
see that the I/O size seen on the disk is shown as 2KB without this revert
while it's 4KB with the revert. In the bad case the write bandwidth
is capped at around 362MiB/sec which almost 2KiB * 180K IOPS so we are
hitting the SSD Disk IOPS limit which is 180K. After the revert the I/O
size has been doubled to 4KiB hence the bandwidth has been almost doubled
as we no longer hit the Disk IOPS limit.
I have done some tracing using bpftrace & bcc and was able to conclude that
the reason behind the I/O size discrepancy with the revert is that this
fio benchmark is subimitting each 4k I/O as 2 contiguous 2KB bios.
In the good case each 2 bios are merged in a 4KB request that's then been
submitted to the disk while in the bad case 2K bios are submitted to the
disk without merging because blk_attempt_plug_merge() failed to merge
them as seen below.
**Without the revert**
[12:12:28]
r::blk_attempt_plug_merge():int:$retval
COUNT EVENT
5618 $retval = 1
176578 $retval = 0
**With the revert**
[12:11:43]
r::blk_attempt_plug_merge():int:$retval
COUNT EVENT
146684 $retval = 0
146686 $retval = 1
In blk_attempt_plug_merge() we are iterating ithrought the plug list
from head to tail looking for a request with which we can merge the
most recently submitted bio.
With commit <e70c301faece> ("block: don't reorder requests in
blk_add_rq_to_plug") the most recent request will be at the tail so
blk_attempt_plug_merge() will fail because it tries to merge bio with
the plug list head. In blk_attempt_plug_merge() we don't iterate across
the whole plug list because as we exit the loop once we fail merging in
blk_attempt_bio_merge().
In commit <bc490f81731> ("block: change plugging to use a singly linked
list") the plug list has been changed to single linked list so there's
no way to iterate the list from tail to head which is the only way to
mitigate the impact on bio merging if we want to keep commit <e70c301faece>
("block: don't reorder requests in blk_add_rq_to_plug").
Given that moving plug list to a single linked list was mainly for
performance reason then let's revert commit <e70c301faece> ("block: don't
reorder requests in blk_add_rq_to_plug") for now to mitigate the
reported performance regression.
[1] https://lore.kernel.org/lkml/202412122112.ca47bcec-lkp@intel.com/
Cc: stable@...r.kernel.org # 6.12
Reported-by: kernel test robot <oliver.sang@...el.com>
Reported-by: Hagar Hemdan <hagarhem@...zon.com>
Reported-and-bisected-by: Shaoying Xu <shaoyi@...zon.com>
Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@...zon.com>
---
block/blk-mq.c | 4 ++--
drivers/block/virtio_blk.c | 2 +-
drivers/nvme/host/pci.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c2697db59109..28965cac19fb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1394,7 +1394,7 @@ static void blk_add_rq_to_plug(struct blk_plug *plug, struct request *rq)
*/
if (!plug->has_elevator && (rq->rq_flags & RQF_SCHED_TAGS))
plug->has_elevator = true;
- rq_list_add_tail(&plug->mq_list, rq);
+ rq_list_add_head(&plug->mq_list, rq);
plug->rq_count++;
}
@@ -2846,7 +2846,7 @@ static void blk_mq_dispatch_plug_list(struct blk_plug *plug, bool from_sched)
rq_list_add_tail(&requeue_list, rq);
continue;
}
- list_add_tail(&rq->queuelist, &list);
+ list_add(&rq->queuelist, &list);
depth++;
} while (!rq_list_empty(&plug->mq_list));
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 7cffea01d868..7992a171f905 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -513,7 +513,7 @@ static void virtio_queue_rqs(struct rq_list *rqlist)
vq = this_vq;
if (virtblk_prep_rq_batch(req))
- rq_list_add_tail(&submit_list, req);
+ rq_list_add_head(&submit_list, req); /* reverse order */
else
rq_list_add_tail(&requeue_list, req);
}
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f1dd804151b1..5f7da42f9dac 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1026,7 +1026,7 @@ static void nvme_queue_rqs(struct rq_list *rqlist)
nvmeq = req->mq_hctx->driver_data;
if (nvme_prep_rq_batch(nvmeq, req))
- rq_list_add_tail(&submit_list, req);
+ rq_list_add_head(&submit_list, req); /* reverse order */
else
rq_list_add_tail(&requeue_list, req);
}
--
2.47.1
Powered by blists - more mailing lists