linux-kernel - Re: 10x I/O await times in 6.12

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <dd2db843-843f-db15-c54f-f2c44548dee3@huaweicloud.com>
Date: Mon, 21 Apr 2025 20:21:13 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Matt Fleming <mfleming@...udflare.com>, Jens Axboe <axboe@...nel.dk>
Cc: linux-block@...r.kernel.org, linux-kernel@...r.kernel.org,
 kernel-team <kernel-team@...udflare.com>, "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: 10x I/O await times in 6.12

Hi,

在 2025/04/21 16:53, Matt Fleming 写道:
> Hey there,
> 
> We're moving to 6.12 at Cloudflare and noticed that write await times
> in iostat are 10x what they were in 6.6. After a bit of bpftracing
> (script to find all plug times above 10ms below), it seems like this
> is an accounting error caused by the plug->cur_ktime optimisation
> rather than anything more material.

What is the base value? If it's microseconds, IO performance should
actually be better.
> 
> It appears as though a task can enter __submit_bio() with ->plug set
> and a very stale cur_ktime value on the order of milliseconds. Is this
> expected behaviour? It looks like it leads to inaccurate I/O times.
> 

For microseconds I think it's expected, however, I don't think
milliseconds is acceptable.
> Thanks,
> Matt
> 
> ---->8----
> $ sudo bpftrace -e 'k:__submit_bio { $p = curtask->plug; if ($p != 0)
> { if (((nsecs - $p->cur_ktime) / 1000) > 10000) { @[kstack] =
> count();}}}'

Can you drop this expensive bpftrace script which might affect IO
performance, and replace __submit_bio directly to __blk_flush_plug? If
nsecs - plug->cur_ktime is still milliseconds, can you check if the
following patch can fix your problem?

Thanks,
Kuai

diff --git a/block/blk-mq.c b/block/blk-mq.c
index ae8494d88897..37197502147e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1095,7 +1095,9 @@ static inline void blk_account_io_start(struct 
request *req)
                 return;

         req->rq_flags |= RQF_IO_STAT;
-       req->start_time_ns = blk_time_get_ns();
+
+       if (!current->plug)
+               req->start_time_ns = blk_time_get_ns();

         /*
          * All non-passthrough requests are created from a bio with one
@@ -2874,6 +2876,7 @@ void blk_mq_flush_plug_list(struct blk_plug *plug, 
bool from_schedule)
  {
         struct request *rq;
         unsigned int depth;
+       u64 now;

         /*
          * We may have been called recursively midway through handling
@@ -2887,6 +2890,10 @@ void blk_mq_flush_plug_list(struct blk_plug 
*plug, bool from_schedule)
         depth = plug->rq_count;
         plug->rq_count = 0;

+       now = ktime_get_ns();
+       rq_list_for_each(&plug->mq_list, rq)
+               rq->start_time_ns = now;
+
         if (!plug->multiple_queues && !plug->has_elevator && 
!from_schedule) {
                 struct request_queue *q;


> Attaching 1 probe...
> ^C
> 
> @[
>      __submit_bio+1
>      submit_bio_noacct_nocheck+390
>      submit_bio_wait+92
>      swap_writepage_bdev_sync+262
>      swap_writepage+315
>      pageout+291
>      shrink_folio_list+1835
>      shrink_lruvec+1683
>      shrink_node+784
>      balance_pgdat+877
>      kswapd+496
>      kthread+207
>      ret_from_fork+49
>      ret_from_fork_asm+26
> ]: 184
> @[
>      __submit_bio+1
>      submit_bio_noacct_nocheck+390
>      _xfs_buf_ioapply+599
>      __xfs_buf_submit+110
>      xfs_buf_delwri_submit_buffers+399
>      xfsaild+691
>      kthread+207
>      ret_from_fork+49
>      ret_from_fork_asm+26
> ]: 28123
> 
> .
>