[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250925120504.GC120@bytedance>
Date: Thu, 25 Sep 2025 20:05:04 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Matteo Martelli <matteo.martelli@...ethink.co.uk>,
Valentin Schneider <vschneid@...hat.com>,
Ben Segall <bsegall@...gle.com>,
Peter Zijlstra <peterz@...radead.org>,
Chengming Zhou <chengming.zhou@...ux.dev>,
Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
Chuyi Zhou <zhouchuyi@...edance.com>,
Jan Kiszka <jan.kiszka@...mens.com>,
Florian Bezdeka <florian.bezdeka@...mens.com>,
Songtang Liu <liusongtang@...edance.com>,
Chen Yu <yu.c.chen@...el.com>,
Michal Koutný <mkoutny@...e.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH 1/4] sched/fair: Propagate load for throttled cfs_rq
On Thu, Sep 25, 2025 at 04:52:25PM +0530, K Prateek Nayak wrote:
>
>
> On 9/25/2025 2:59 PM, Aaron Lu wrote:
> > Hi Prateek,
> >
> > On Thu, Sep 25, 2025 at 01:47:35PM +0530, K Prateek Nayak wrote:
> >> Hello Aaron, Matteo,
> >>
> >> On 9/24/2025 5:03 PM, Aaron Lu wrote:
> >>>> [ 18.421350] WARNING: CPU: 0 PID: 1 at kernel/sched/fair.c:400 enqueue_task_fair+0x925/0x980
> >>>
> >>> I stared at the code and haven't been able to figure out when
> >>> enqueue_task_fair() would end up with a broken leaf cfs_rq list.
> >>
> >> Yeah neither could I. I tried running with PREEMPT_RT too and still
> >> couldn't trigger it :(
> >>
> >> But I'm wondering if all we are missing is:
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index f993de30e146..5f9e7b4df391 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -6435,6 +6435,7 @@ static void sync_throttle(struct task_group *tg, int cpu)
> >>
> >> cfs_rq->throttle_count = pcfs_rq->throttle_count;
> >> cfs_rq->throttled_clock_pelt = rq_clock_pelt(cpu_rq(cpu));
> >> + cfs_rq->pelt_clock_throttled = pcfs_rq->pelt_clock_throttled;
> >> }
> >>
> >> /* conditionally throttle active cfs_rq's from put_prev_entity() */
> >> ---
> >>
> >> This is the only way we can currently have a break in
> >> cfs_rq_pelt_clock_throttled() hierarchy.
> >>
> >
> > Great finding! Yes, that is missed.
> >
> > According to this info, I'm able to trigger the assert in
> > enqueue_task_fair(). The stack is different from Matteo's: his stack is
> > from ttwu path while mine is from exit. Anyway, let me do more analysis
> > and get back to you:
> >
> > [ 67.041905] ------------[ cut here ]------------
> > [ 67.042387] WARNING: CPU: 2 PID: 11582 at kernel/sched/fair.c:401 enqueue_task_fair+0x6db/0x720
> > [ 67.043227] Modules linked in:
> > [ 67.043537] CPU: 2 UID: 0 PID: 11582 Comm: sudo Tainted: G W 6.17.0-rc4-00010-gfe8d238e646e-dirty #72 PREEMPT(voluntary)
> > [ 67.044694] Tainted: [W]=WARN
> > [ 67.044997] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
> > [ 67.045910] RIP: 0010:enqueue_task_fair+0x6db/0x720
> > [ 67.046383] Code: 00 48 c7 c7 96 b2 60 82 c6 05 af 64 2e 05 01 e8 fb 12 03 00 8b 4c 24 04 e9 f8 fc ff ff 4c 89 ef e8 ea a2 ff ff e9 ad fa ff ff <0f> 0b e9 5d fc ff ff 49 8b b4 24 08 0b 00 00 4c 89 e7 e8 de 31 01
> > [ 67.048155] RSP: 0018:ffa000002d2a7dc0 EFLAGS: 00010087
> > [ 67.048655] RAX: ff11000ff05fd2e8 RBX: 0000000000000000 RCX: 0000000000000004
> > [ 67.049339] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ff11000ff05fd1f0
> > [ 67.050036] RBP: 0000000000000001 R08: 0000000000000000 R09: ff11000ff05fc908
> > [ 67.050731] R10: 0000000000000000 R11: 00000000fa83b2da R12: ff11000ff05fc800
> > [ 67.051402] R13: 0000000000000000 R14: 00000000002ab980 R15: ff11000ff05fc8c0
> > [ 67.052083] FS: 0000000000000000(0000) GS:ff110010696a6000(0000) knlGS:0000000000000000
> > [ 67.052855] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 67.053404] CR2: 00007f67f8b96168 CR3: 0000000002c3c006 CR4: 0000000000371ef0
> > [ 67.054083] Call Trace:
> > [ 67.054334] <TASK>
> > [ 67.054546] enqueue_task+0x35/0xd0
> > [ 67.054885] sched_move_task+0x291/0x370
> > [ 67.055268] ? kmem_cache_free+0x2d9/0x480
> > [ 67.055669] do_exit+0x204/0x4f0
> > [ 67.055984] ? lock_release+0x10a/0x170
> > [ 67.056356] do_group_exit+0x36/0xa0
> > [ 67.056714] __x64_sys_exit_group+0x18/0x20
> > [ 67.057121] x64_sys_call+0x14fa/0x1720
> > [ 67.057502] do_syscall_64+0x6a/0x2d0
> > [ 67.057865] entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> Great! I'll try stressing this path too.
I now also see other paths leading to enqueue_task_fair() too, so I
think this is the same problem as seen by Matteo.
> P.S. Are you seeing this with sync_throttle() fix too?
Nope, your finding fixed it for me :)
I added some trace prints but due to too many traces, it keeps losing
those critical ones.
Anyway, I think I've figured out how it happened: during
online_fair_sched_group() -> sync_throttle(), the newly onlined cfs_rq
didn't have pelt_clock_throttled synced. Suppose its parent's pelt clock
is throttled, then in propagate_entity_cfs_rq(), this newly onlined
cfs_rq is added to leaf list but its parent is not. Now
rq->tmp_alone_branch points to this newly onlined cfs_rq, waiting for
its parent to be added(but this didn't happen).
Then another task wakes up and gets enqueued on this same cpu, all its
ancestor cfs_rqs are already on the list so list_add_leaf_cfs_rq()
didn't manipulate rq->tmp_alone_branch. At the end of the enqueue,
assert will fire.
I'm thinking we should add an assert_list_leaf_cfs_rq() at the end of
propagate_entity_cfs_rq() to capture other potential problems.
Hi Matteo,
Can you test the above diff Prateek sent in his last email? Thanks.
Powered by blists - more mailing lists