[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dd2db843-843f-db15-c54f-f2c44548dee3@huaweicloud.com>
Date: Mon, 21 Apr 2025 20:21:13 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Matt Fleming <mfleming@...udflare.com>, Jens Axboe <axboe@...nel.dk>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
kernel-team <kernel-team@...udflare.com>, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: 10x I/O await times in 6.12
Hi,
在 2025/04/21 16:53, Matt Fleming 写道:
> Hey there,
>
> We're moving to 6.12 at Cloudflare and noticed that write await times
> in iostat are 10x what they were in 6.6. After a bit of bpftracing
> (script to find all plug times above 10ms below), it seems like this
> is an accounting error caused by the plug->cur_ktime optimisation
> rather than anything more material.
What is the base value? If it's microseconds, IO performance should
actually be better.
>
> It appears as though a task can enter __submit_bio() with ->plug set
> and a very stale cur_ktime value on the order of milliseconds. Is this
> expected behaviour? It looks like it leads to inaccurate I/O times.
>
For microseconds I think it's expected, however, I don't think
milliseconds is acceptable.
> Thanks,
> Matt
>
> ---->8----
> $ sudo bpftrace -e 'k:__submit_bio { $p = curtask->plug; if ($p != 0)
> { if (((nsecs - $p->cur_ktime) / 1000) > 10000) { @[kstack] =
> count();}}}'
Can you drop this expensive bpftrace script which might affect IO
performance, and replace __submit_bio directly to __blk_flush_plug? If
nsecs - plug->cur_ktime is still milliseconds, can you check if the
following patch can fix your problem?
Thanks,
Kuai
diff --git a/block/blk-mq.c b/block/blk-mq.c
index ae8494d88897..37197502147e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1095,7 +1095,9 @@ static inline void blk_account_io_start(struct
request *req)
return;
req->rq_flags |= RQF_IO_STAT;
- req->start_time_ns = blk_time_get_ns();
+
+ if (!current->plug)
+ req->start_time_ns = blk_time_get_ns();
/*
* All non-passthrough requests are created from a bio with one
@@ -2874,6 +2876,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug,
bool from_schedule)
{
struct request *rq;
unsigned int depth;
+ u64 now;
/*
* We may have been called recursively midway through handling
@@ -2887,6 +2890,10 @@ void blk_mq_flush_plug_list(struct blk_plug
*plug, bool from_schedule)
depth = plug->rq_count;
plug->rq_count = 0;
+ now = ktime_get_ns();
+ rq_list_for_each(&plug->mq_list, rq)
+ rq->start_time_ns = now;
+
if (!plug->multiple_queues && !plug->has_elevator &&
!from_schedule) {
struct request_queue *q;
> Attaching 1 probe...
> ^C
>
> @[
> __submit_bio+1
> submit_bio_noacct_nocheck+390
> submit_bio_wait+92
> swap_writepage_bdev_sync+262
> swap_writepage+315
> pageout+291
> shrink_folio_list+1835
> shrink_lruvec+1683
> shrink_node+784
> balance_pgdat+877
> kswapd+496
> kthread+207
> ret_from_fork+49
> ret_from_fork_asm+26
> ]: 184
> @[
> __submit_bio+1
> submit_bio_noacct_nocheck+390
> _xfs_buf_ioapply+599
> __xfs_buf_submit+110
> xfs_buf_delwri_submit_buffers+399
> xfsaild+691
> kthread+207
> ret_from_fork+49
> ret_from_fork_asm+26
> ]: 28123
>
> .
>
Powered by blists - more mailing lists