linux-kernel - Re: 10x I/O await times in 6.12

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ceefa848-082b-0153-e41d-26c0559f3666@huaweicloud.com>
Date: Tue, 22 Apr 2025 09:39:17 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Matt Fleming <mfleming@...udflare.com>, Keith Busch <kbusch@...nel.org>
Cc: Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
 linux-kernel@...r.kernel.org, kernel-team <kernel-team@...udflare.com>,
 "yukuai (C)" <yukuai3@...wei.com>
Subject: Re: 10x I/O await times in 6.12

Hi,

在 2025/04/22 3:10, Matt Fleming 写道:
> On Mon, 21 Apr 2025 at 19:53, Keith Busch <kbusch@...nel.org> wrote:
>>
>> Not sure. I'm also guessing cond_resched is the reason for your
>> observation, so that might be worth confirming is happening in whatever
>> IO paths you're workload is taking in case there's some other
>> explanation.
> 
> Yep, you're spot on. We're hitting cond_resched() from various code
> paths (xfs_buf_delwri_submit_buffers(), swap_writepage(),
> rmap_walk_file(), etc, etc).

All pluged IO must be submited before scheduled out, there is no point
for this direction. :(

Please check the other mail that I replied to your original report,
it'll make sense if a task keeps running on one cpu for milliseconds.

Thanks,
Kuai

> 
> sudo bpftrace -e 'k:psi_task_switch {        $prev = (struct
> task_struct *)arg0;        if ($prev->plug != 0) {
>                  if ($prev->plug->cur_ktime) {
>                          @[kstack(3)] = count();
>                  }
>          }
> }'
> Attaching 1 probe...
> ^C
> 
> @[
>      psi_task_switch+5
>      __schedule+2081
>      __cond_resched+51
> ]: 3044
> 
>> fs-writeback happens to work around it by unplugging if it knows
>> cond_resched is going to schedule. The decision to unplug here wasn't
>> necessarily because of the plug's ktime, but it gets the job done:
>>
>>    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fs-writeback.c?h=v6.15-rc3#n1984
>>
>> Doesn't really scale well to copy this for every caller of
>> cond_resched(), though. An io specific helper implementation of
>> cond_resched might help.
>>
>> Or if we don't want cond_resched to unplug (though I feel like you would
>> normally want that), I think we could invalidate the ktime when
>> scheduling to get the stats to read the current ktime after the process
>> is scheduled back in.
> 
> Thanks. Makes sense to me. I'll try this out and report back.
> 
>> ---
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6978,6 +6978,9 @@ static void __sched notrace preempt_schedule_common(void)
>>                   * between schedule and now.
>>                   */
>>          } while (need_resched());
>> +
>> +       if (current->flags & PF_BLOCK_TS)
>> +               blk_plug_invalidate_ts(current);
>>   }
>>
>>   #ifdef CONFIG_PREEMPTION
>> --
> 
> .
>