[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ceefa848-082b-0153-e41d-26c0559f3666@huaweicloud.com>
Date: Tue, 22 Apr 2025 09:39:17 +0800
From: Yu Kuai <yukuai1@...weicloud.com>
To: Matt Fleming <mfleming@...udflare.com>, Keith Busch <kbusch@...nel.org>
Cc: Jens Axboe <axboe@...nel.dk>, linux-block@...r.kernel.org,
linux-kernel@...r.kernel.org, kernel-team <kernel-team@...udflare.com>,
"yukuai (C)" <yukuai3@...wei.com>
Subject: Re: 10x I/O await times in 6.12
Hi,
在 2025/04/22 3:10, Matt Fleming 写道:
> On Mon, 21 Apr 2025 at 19:53, Keith Busch <kbusch@...nel.org> wrote:
>>
>> Not sure. I'm also guessing cond_resched is the reason for your
>> observation, so that might be worth confirming is happening in whatever
>> IO paths you're workload is taking in case there's some other
>> explanation.
>
> Yep, you're spot on. We're hitting cond_resched() from various code
> paths (xfs_buf_delwri_submit_buffers(), swap_writepage(),
> rmap_walk_file(), etc, etc).
All pluged IO must be submited before scheduled out, there is no point
for this direction. :(
Please check the other mail that I replied to your original report,
it'll make sense if a task keeps running on one cpu for milliseconds.
Thanks,
Kuai
>
> sudo bpftrace -e 'k:psi_task_switch { $prev = (struct
> task_struct *)arg0; if ($prev->plug != 0) {
> if ($prev->plug->cur_ktime) {
> @[kstack(3)] = count();
> }
> }
> }'
> Attaching 1 probe...
> ^C
>
> @[
> psi_task_switch+5
> __schedule+2081
> __cond_resched+51
> ]: 3044
>
>> fs-writeback happens to work around it by unplugging if it knows
>> cond_resched is going to schedule. The decision to unplug here wasn't
>> necessarily because of the plug's ktime, but it gets the job done:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/fs-writeback.c?h=v6.15-rc3#n1984
>>
>> Doesn't really scale well to copy this for every caller of
>> cond_resched(), though. An io specific helper implementation of
>> cond_resched might help.
>>
>> Or if we don't want cond_resched to unplug (though I feel like you would
>> normally want that), I think we could invalidate the ktime when
>> scheduling to get the stats to read the current ktime after the process
>> is scheduled back in.
>
> Thanks. Makes sense to me. I'll try this out and report back.
>
>> ---
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6978,6 +6978,9 @@ static void __sched notrace preempt_schedule_common(void)
>> * between schedule and now.
>> */
>> } while (need_resched());
>> +
>> + if (current->flags & PF_BLOCK_TS)
>> + blk_plug_invalidate_ts(current);
>> }
>>
>> #ifdef CONFIG_PREEMPTION
>> --
>
> .
>
Powered by blists - more mailing lists