lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <68e34465-ecb6-409e-800c-3dd354156bb0@linux.dev>
Date: Mon, 25 Aug 2025 14:58:35 +0800
From: Chengming Zhou <chengming.zhou@...ux.dev>
To: Josh Don <joshdon@...gle.com>
Cc: Christian Loehle <christian.loehle@....com>,
 "Chen, Yu C" <yu.c.chen@...el.com>, linux-kernel@...r.kernel.org,
 mingo@...hat.com, bsegall@...gle.com, vschneid@...hat.com,
 juri.lelli@...hat.com, rostedt@...dmis.org, mgorman@...e.de,
 dietmar.eggemann@....com, vincent.guittot@...aro.org, peterz@...radead.org,
 viresh.kumar@...aro.org
Subject: Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in
 select_task_rq_fair()

On 2025/8/22 02:13, Josh Don wrote:
> On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@...ux.dev> wrote:
>>
>> +cc Josh and Viresh, I forgot to cc you, sorry!
> 
> Thanks, missed this previously :)
> 
>>
>> On 2025/8/20 21:53, Christian Loehle wrote:
>>> On 8/19/25 16:32, Chen, Yu C wrote:
>>>> On 8/18/2025 9:24 PM, Christian Loehle wrote:
>>>>> On 8/18/25 13:47, Chengming Zhou wrote:
>>>>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
>>>>>> on an assumption that the wakee task can pick a cpu running sched_idle
>>>>>> task and preempt it to run, faster than picking an idle cpu to preempt
>>>>>> the idle task.
>>>>>>
>>>>>> This assumption is correct, but it also brings some problems:
>>>>>>
>>>>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
>>>>>> which is already running sched_idle task, instead of utilizing a real
>>>>>> idle cpu, so work conservation is somewhat broken.
>>>>>>
>>>>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
>>>>>> sched_idle group running. Look a simple example below.
>>>>>>
>>>>>>           root
>>>>>>       /        \
>>>>>>       kubepods    system
>>>>>>       /    \
>>>>>> burstable    besteffort
>>>>>>           (cpu.idle == 1)
> 
> Thanks for bringing attention to this scenario, it's been a case I've
> worried about but haven't had a good idea about fixing. Ideally we
> could find_matching_se(), but we want to do these checks locklessly
> and quickly, so that's out of the question. Agree on it being a hard
> problem.

Yeah, right, we don't want to use find_matching_se() here.

> 
> One idea is that we at least handle the (what I think is fairly
> typical) scenario of a root-level sched_idle group well (a root level

You mean /kubepods and /system group in this case, right? Both of
them are not sched_idle here.

> sched_idle group is trivially idle with respect to anything else in
> the system that is not also nested under a root-level sched_idle
> group). It would be fairly easy to track a nr_idle_queued cfs_rq
> field, as well as cache on task enqueue whether it nests under a
> sched_idle group.

Ok, we can track if a task nests under a sched_idle group, like tasks
from /system and /kubepods/burstable are not under any sched_idle group,
there seems no way to distinguish them except using find_matching_se().

Thanks!

> 
> Best,
> Josh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ