lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CABk29Ntu5ywHeoJqkP0t85V9zWr2wtoy4ijf6wTFDkBdp4pAHw@mail.gmail.com>
Date: Tue, 26 Aug 2025 11:50:51 -0700
From: Josh Don <joshdon@...gle.com>
To: Chengming Zhou <chengming.zhou@...ux.dev>
Cc: Christian Loehle <christian.loehle@....com>, "Chen, Yu C" <yu.c.chen@...el.com>, 
	linux-kernel@...r.kernel.org, mingo@...hat.com, bsegall@...gle.com, 
	vschneid@...hat.com, juri.lelli@...hat.com, rostedt@...dmis.org, 
	mgorman@...e.de, dietmar.eggemann@....com, vincent.guittot@...aro.org, 
	peterz@...radead.org, viresh.kumar@...aro.org
Subject: Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

On Sun, Aug 24, 2025 at 11:58 PM Chengming Zhou
<chengming.zhou@...ux.dev> wrote:
>
> On 2025/8/22 02:13, Josh Don wrote:
> > On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@...ux.dev> wrote:
> >>
> >> +cc Josh and Viresh, I forgot to cc you, sorry!
> >
> > Thanks, missed this previously :)
> >
> >>
> >> On 2025/8/20 21:53, Christian Loehle wrote:
> >>> On 8/19/25 16:32, Chen, Yu C wrote:
> >>>> On 8/18/2025 9:24 PM, Christian Loehle wrote:
> >>>>> On 8/18/25 13:47, Chengming Zhou wrote:
> >>>>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> >>>>>> on an assumption that the wakee task can pick a cpu running sched_idle
> >>>>>> task and preempt it to run, faster than picking an idle cpu to preempt
> >>>>>> the idle task.
> >>>>>>
> >>>>>> This assumption is correct, but it also brings some problems:
> >>>>>>
> >>>>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
> >>>>>> which is already running sched_idle task, instead of utilizing a real
> >>>>>> idle cpu, so work conservation is somewhat broken.
> >>>>>>
> >>>>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> >>>>>> sched_idle group running. Look a simple example below.
> >>>>>>
> >>>>>>           root
> >>>>>>       /        \
> >>>>>>       kubepods    system
> >>>>>>       /    \
> >>>>>> burstable    besteffort
> >>>>>>           (cpu.idle == 1)
> >
> > Thanks for bringing attention to this scenario, it's been a case I've
> > worried about but haven't had a good idea about fixing. Ideally we
> > could find_matching_se(), but we want to do these checks locklessly
> > and quickly, so that's out of the question. Agree on it being a hard
> > problem.
>
> Yeah, right, we don't want to use find_matching_se() here.
>
> >
> > One idea is that we at least handle the (what I think is fairly
> > typical) scenario of a root-level sched_idle group well (a root level
>
> You mean /kubepods and /system group in this case, right? Both of
> them are not sched_idle here.

Correct

> > sched_idle group is trivially idle with respect to anything else in
> > the system that is not also nested under a root-level sched_idle
> > group). It would be fairly easy to track a nr_idle_queued cfs_rq
> > field, as well as cache on task enqueue whether it nests under a
> > sched_idle group.
>
> Ok, we can track if a task nests under a sched_idle group, like tasks
> from /system and /kubepods/burstable are not under any sched_idle group,
> there seems no way to distinguish them except using find_matching_se().

nr_idle_queued on the cfs_rq seems like the way to do it, but I agree
it is a tricky problem.

>
> Thanks!
>
> >
> > Best,
> > Josh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ