linux-kernel - Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fb530c13-9ff6-46bd-b9fd-6e9a8ddd66c1@amd.com>
Date: Fri, 26 Sep 2025 10:02:53 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Matteo Martelli <matteo.martelli@...ethink.co.uk>, Aaron Lu
	<ziqianlu@...edance.com>
CC: Valentin Schneider <vschneid@...hat.com>, Ben Segall <bsegall@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>, Chengming Zhou
	<chengming.zhou@...ux.dev>, Josh Don <joshdon@...gle.com>, Ingo Molnar
	<mingo@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>, Xi Wang
	<xii@...gle.com>, <linux-kernel@...r.kernel.org>, Juri Lelli
	<juri.lelli@...hat.com>, Dietmar Eggemann <dietmar.eggemann@....com>, "Steven
 Rostedt" <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>, Chuyi Zhou
	<zhouchuyi@...edance.com>, Jan Kiszka <jan.kiszka@...mens.com>, "Florian
 Bezdeka" <florian.bezdeka@...mens.com>, Songtang Liu
	<liusongtang@...edance.com>, Chen Yu <yu.c.chen@...el.com>,
	Michal Koutný <mkoutny@...e.com>, Sebastian Andrzej Siewior
	<bigeasy@...utronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

Hello Aaron, Matteo,

On 9/25/2025 7:03 PM, Matteo Martelli wrote:
> Hi Aaron,
> 
> On Thu, 25 Sep 2025 20:05:04 +0800, Aaron Lu <ziqianlu@...edance.com> wrote:
>> On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
>>>
>>> On 9/25/2025 2:59 PM, Aaron Lu wrote:
>>>> Hi Prateek,
>>>>
>>>> On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
>>>>> Hello Aaron, Matteo,
>>>>>
>>>>> On 9/24/2025 5:03 PM, Aaron Lu wrote:
>>>>>>> ...
>>>>>>> The test setup is the same used in my previous testing for v3 [2], where
>>>>>>> the CFS throttling events are mostly triggered by the first ssh logins
>>>>>>> into the system as the systemd user slice is configured with CPUQuota of
>>>>>>> 25%. Also note that the same systemd user slice is configured with CPU
>>>>>>
>>>>>> I tried to replicate this setup, below is my setup using a 4 cpu VM
>>>>>> and rt kernel at commit fe8d238e646e("sched/fair: Propagate load for
>>>>>> throttled cfs_rq"):
>>>>>> # pwd
>>>>>> /sys/fs/cgroup/user.slice
>>>>>> # cat cpu.max
>>>>>> 25000 100000
>>>>>> # cat cpuset.cpus
>>>>>> 0
>>>>>>
>>>>>> I then login using ssh as a normal user and I can see throttle happened
>>>>>> but couldn't hit this warning. Do you have to do something special to
>>>>>> trigger it?
> 
> It wasn't very reproducible in my setup either, but I found out that the
> warning was being triggered more often when I tried to ssh into the
> system just after boot, probably due to some additional load from
> processes spawned during the boot phase. Therefore I prepared a
> reproducer script that resemble my initial setup, plus a stress-ng
> worker in the background while connecting with ssh to the system. I also
> reduced the CPUQuota to 10% which seemed to increase the probability to
> trigger the warning. With this script I can reproduce the warning about
> once or twice every 10 ssh executions. See the script at the end of this
> email.

I have a similar setup with a bunch hackbench instances going in cgroups
with bandwidth limits set and I keep creating/removing cgroups on this
hierarchy and keeping moving some tasks between them.

> 
>>>>>>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
>>>>>>
>>>>>> I stared at the code and haven't been able to figure out when
>>>>>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
>>>>>>
> 
>>>>>
>>>>> Yeah neither could I. I tried running with PREEMPT_RT too and still
>>>>> couldn't trigger it :(
>>>>>
>>>>> But I'm wondering if all we are missing is:
>>>>>
>>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>>> index f993de30e146..5f9e7b4df391 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
>>>>>  
>>>>>  	cfs_rq->throttle_count = pcfs_rq->throttle_count;
>>>>>  	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
>>>>> +	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
>>>>>  }
>>>>>  
>>>>>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
>>>>> ---
>>>>>
>>>>> This is the only way we can currently have a break in
>>>>> cfs_rq_pelt_clock_throttled() hierarchy.
>>>>>
>> ...
>>
>> Hi Matteo,
>>
>> Can you test the above diff Prateek sent in his last email? Thanks.
>>
> 
> I have just tested with the same script below the diff sent by Prateek
> in [1] (also quoted above) that changes sync_throttle(), and I couldn't
> reproduce the warning.

Thank you both for testing the diff and providing the setup! I'll post a
formal patch soon on the thread.

-- 
Thanks and Regards,
Prateek