linux-kernel - Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250925120504.GC120@bytedance>
Date: Thu, 25 Sep 2025 20:05:04 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Matteo Martelli <matteo.martelli@...ethink.co.uk>,
	Valentin Schneider <vschneid@...hat.com>,
	Ben Segall <bsegall@...gle.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Chengming Zhou <chengming.zhou@...ux.dev>,
	Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
	Juri Lelli <juri.lelli@...hat.com>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
	Chuyi Zhou <zhouchuyi@...edance.com>,
	Jan Kiszka <jan.kiszka@...mens.com>,
	Florian Bezdeka <florian.bezdeka@...mens.com>,
	Songtang Liu <liusongtang@...edance.com>,
	Chen Yu <yu.c.chen@...el.com>,
	Michal Koutný <mkoutny@...e.com>,
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq

On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
> 
> 
> On 9/25/2025 2:59 PM, Aaron Lu wrote:
> > Hi Prateek,
> > 
> > On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
> >> Hello Aaron, Matteo,
> >>
> >> On 9/24/2025 5:03 PM, Aaron Lu wrote:
> >>>> [   18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> >>>
> >>> I stared at the code and haven't been able to figure out when
> >>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
> >>
> >> Yeah neither could I. I tried running with PREEMPT_RT too and still
> >> couldn't trigger it :(
> >>
> >> But I'm wondering if all we are missing is:
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index f993de30e146..5f9e7b4df391 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
> >>  
> >>  	cfs_rq->throttle_count = pcfs_rq->throttle_count;
> >>  	cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
> >> +	cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
> >>  }
> >>  
> >>  /* conditionally throttle active cfs_rq's from put_prev_entity() */
> >> ---
> >>
> >> This is the only way we can currently have a break in
> >> cfs_rq_pelt_clock_throttled() hierarchy.
> >>
> > 
> > Great finding! Yes, that is missed.
> > 
> > According to this info, I'm able to trigger the assert in
> > enqueue_task_fair(). The stack is different from Matteo's: his stack is
> > from ttwu path while mine is from exit. Anyway, let me do more analysis
> > and get back to you:
> > 
> > [   67.041905] ------------[ cut here ]------------
> > [   67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
> > [   67.043227] Modules linked in:
> > [   67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G        W           6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
> > [   67.044694] Tainted: [W]=WARN
> > [   67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > [   67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
> > [   67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
> > [   67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
> > [   67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
> > [   67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
> > [   67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
> > [   67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
> > [   67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
> > [   67.052083] FS:  0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
> > [   67.052855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
> > [   67.054083] Call Trace:
> > [   67.054334]  <TASK>
> > [   67.054546]  enqueue_task+0x35/0xd0
> > [   67.054885]  sched_move_task+0x291/0x370
> > [   67.055268]  ? kmem_cache_free+0x2d9/0x480
> > [   67.055669]  do_exit+0x204/0x4f0
> > [   67.055984]  ? lock_release+0x10a/0x170
> > [   67.056356]  do_group_exit+0x36/0xa0
> > [   67.056714]  __x64_sys_exit_group+0x18/0x20
> > [   67.057121]  x64_sys_call+0x14fa/0x1720
> > [   67.057502]  do_syscall_64+0x6a/0x2d0
> > [   67.057865]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Great! I'll try stressing this path too.

I now also see other paths leading to enqueue_task_fair() too, so I
think this is the same problem as seen by Matteo.

> P.S. Are you seeing this with sync_throttle() fix too?

Nope, your finding fixed it for me :)

I added some trace prints but due to too many traces, it keeps losing
those critical ones.

Anyway, I think I've figured out how it happened: during
online_fair_sched_group() -> sync_throttle(), the newly onlined cfs_rq
didn't have pelt_clock_throttled synced. Suppose its parent's pelt clock
is throttled, then in propagate_entity_cfs_rq(), this newly onlined
cfs_rq is added to leaf list but its parent is not. Now
rq->tmp_alone_branch points to this newly onlined cfs_rq, waiting for
its parent to be added(but this didn't happen).

Then another task wakes up and gets enqueued on this same cpu, all its
ancestor cfs_rqs are already on the list so list_add_leaf_cfs_rq()
didn't manipulate rq->tmp_alone_branch. At the end of the enqueue,
assert will fire.

I'm thinking we should add an assert_list_leaf_cfs_rq() at the end of
propagate_entity_cfs_rq() to capture other potential problems.

Hi Matteo,

Can you test the above diff Prateek sent in his last email? Thanks.