[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CABk29NsqoF3U9nECBxh2cDWoPn=7cX+0sDfnpysNRb9HUcRyHg@mail.gmail.com>
Date: Thu, 21 Aug 2025 11:13:48 -0700
From: Josh Don <joshdon@...gle.com>
To: Chengming Zhou <chengming.zhou@...ux.dev>
Cc: Christian Loehle <christian.loehle@....com>, "Chen, Yu C" <yu.c.chen@...el.com>,
linux-kernel@...r.kernel.org, mingo@...hat.com, bsegall@...gle.com,
vschneid@...hat.com, juri.lelli@...hat.com, rostedt@...dmis.org,
mgorman@...e.de, dietmar.eggemann@....com, vincent.guittot@...aro.org,
peterz@...radead.org, viresh.kumar@...aro.org
Subject: Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()
On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@...ux.dev> wrote:
>
> +cc Josh and Viresh, I forgot to cc you, sorry!
Thanks, missed this previously :)
>
> On 2025/8/20 21:53, Christian Loehle wrote:
> > On 8/19/25 16:32, Chen, Yu C wrote:
> >> On 8/18/2025 9:24 PM, Christian Loehle wrote:
> >>> On 8/18/25 13:47, Chengming Zhou wrote:
> >>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> >>>> on an assumption that the wakee task can pick a cpu running sched_idle
> >>>> task and preempt it to run, faster than picking an idle cpu to preempt
> >>>> the idle task.
> >>>>
> >>>> This assumption is correct, but it also brings some problems:
> >>>>
> >>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
> >>>> which is already running sched_idle task, instead of utilizing a real
> >>>> idle cpu, so work conservation is somewhat broken.
> >>>>
> >>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> >>>> sched_idle group running. Look a simple example below.
> >>>>
> >>>> root
> >>>> / \
> >>>> kubepods system
> >>>> / \
> >>>> burstable besteffort
> >>>> (cpu.idle == 1)
Thanks for bringing attention to this scenario, it's been a case I've
worried about but haven't had a good idea about fixing. Ideally we
could find_matching_se(), but we want to do these checks locklessly
and quickly, so that's out of the question. Agree on it being a hard
problem.
One idea is that we at least handle the (what I think is fairly
typical) scenario of a root-level sched_idle group well (a root level
sched_idle group is trivially idle with respect to anything else in
the system that is not also nested under a root-level sched_idle
group). It would be fairly easy to track a nr_idle_queued cfs_rq
field, as well as cache on task enqueue whether it nests under a
sched_idle group.
Best,
Josh
Powered by blists - more mailing lists