linux-kernel - Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABk29NsqoF3U9nECBxh2cDWoPn=7cX+0sDfnpysNRb9HUcRyHg@mail.gmail.com>
Date: Thu, 21 Aug 2025 11:13:48 -0700
From: Josh Don <joshdon@...gle.com>
To: Chengming Zhou <chengming.zhou@...ux.dev>
Cc: Christian Loehle <christian.loehle@....com>, "Chen, Yu C" <yu.c.chen@...el.com>, 
	linux-kernel@...r.kernel.org, mingo@...hat.com, bsegall@...gle.com, 
	vschneid@...hat.com, juri.lelli@...hat.com, rostedt@...dmis.org, 
	mgorman@...e.de, dietmar.eggemann@....com, vincent.guittot@...aro.org, 
	peterz@...radead.org, viresh.kumar@...aro.org
Subject: Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@...ux.dev> wrote:
>
> +cc Josh and Viresh, I forgot to cc you, sorry!

Thanks, missed this previously :)

>
> On 2025/8/20 21:53, Christian Loehle wrote:
> > On 8/19/25 16:32, Chen, Yu C wrote:
> >> On 8/18/2025 9:24 PM, Christian Loehle wrote:
> >>> On 8/18/25 13:47, Chengming Zhou wrote:
> >>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> >>>> on an assumption that the wakee task can pick a cpu running sched_idle
> >>>> task and preempt it to run, faster than picking an idle cpu to preempt
> >>>> the idle task.
> >>>>
> >>>> This assumption is correct, but it also brings some problems:
> >>>>
> >>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
> >>>> which is already running sched_idle task, instead of utilizing a real
> >>>> idle cpu, so work conservation is somewhat broken.
> >>>>
> >>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> >>>> sched_idle group running. Look a simple example below.
> >>>>
> >>>>          root
> >>>>      /        \
> >>>>      kubepods    system
> >>>>      /    \
> >>>> burstable    besteffort
> >>>>          (cpu.idle == 1)

Thanks for bringing attention to this scenario, it's been a case I've
worried about but haven't had a good idea about fixing. Ideally we
could find_matching_se(), but we want to do these checks locklessly
and quickly, so that's out of the question. Agree on it being a hard
problem.

One idea is that we at least handle the (what I think is fairly
typical) scenario of a root-level sched_idle group well (a root level
sched_idle group is trivially idle with respect to anything else in
the system that is not also nested under a root-level sched_idle
group). It would be fairly easy to track a nr_idle_queued cfs_rq
field, as well as cache on task enqueue whether it nests under a
sched_idle group.

Best,
Josh