linux-kernel - Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with PELT clock throttled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <83bb46158288dfb314fdf07918b074ae@codethink.co.uk>
Date: Fri, 26 Sep 2025 16:48:57 +0200
From: Matteo Martelli <matteo.martelli@...ethink.co.uk>
To: K Prateek Nayak <kprateek.nayak@....com>, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>, Aaron Lu
	<ziqianlu@...edance.com>, linux-kernel@...r.kernel.org
Cc: Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt
	<rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel Gorman
	<mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>, K Prateek Nayak
	<kprateek.nayak@....com>, Matteo Martelli <matteo.martelli@...ethink.co.uk>
Subject: Re: [PATCH] sched/fair: Start a cfs_rq on throttled hierarchy with
 PELT clock throttled

Hi Prateek,

On Fri, 26 Sep 2025 08:19:17 +0000, K Prateek Nayak <kprateek.nayak@....com> wrote:
> Matteo reported hitting the assert_list_leaf_cfs_rq() warning from
> enqueue_task_fair() post commit fe8d238e646e ("sched/fair: Propagate
> load for throttled cfs_rq") which transitioned to using
> cfs_rq_pelt_clock_throttled() check for leaf cfs_rq insertions in
> propagate_entity_cfs_rq().
> 
> The "cfs_rq->pelt_clock_throttled" flag is used to indicate if the
> hierarchy has its PELT frozen. If a cfs_rq's PELT is marked frozen, all
> its descendants should have their PELT frozen too or weird things can
> happen as a result of children accumulating PELT signals when the
> parents have their PELT clock stopped.
> 
> Another side effect of this is the loss of integrity of the leaf cfs_rq
> list. As debugged by Aaron, consider the following hierarchy:
> 
>     root(#)
>    /    \
>   A(#)   B(*)
>          |
>          C <--- new cgroup
>          |
>          D <--- new cgroup
> 
>   # - Already on leaf cfs_rq list
>   * - Throttled with PELT frozen
> 
> The newly created cgroups don't have their "pelt_clock_throttled" signal
> synced with cgroup B. Next, the following series of events occur:
> 
> 1. online_fair_sched_group() for cgroup D will call
>    propagate_entity_cfs_rq(). (Same can happen if a throttled task is
>    moved to cgroup C and enqueue_task_fair() returns early.)
> 
>    propagate_entity_cfs_rq() adds the cfs_rq of cgroup C to
>    "rq->tmp_alone_branch" since its PELT clock is not marked throttled
>    and cfs_rq of cgroup B is not on the list.
> 
>    cfs_rq of cgroup B is skipped since its PELT is throttled.
> 
>    root cfs_rq already exists on cfs_rq leading to
>    list_add_leaf_cfs_rq() returning early.
> 
>    The cfs_rq of cgroup C is left dangling on the
>    "rq->tmp_alone_branch".
> 
> 2. A new task wakes up on cgroup A. Since the whole hierarchy is already
>    on the leaf cfs_rq list, list_add_leaf_cfs_rq() keeps returning early
>    without any modifications to "rq->tmp_alone_branch".
> 
>    The final assert_list_leaf_cfs_rq() in enqueue_task_fair() sees the
>    dangling reference to cgroup C's cfs_rq in "rq->tmp_alone_branch".
> 
>    !!! Splat !!!
> 
> Syncing the "pelt_clock_throttled" indicator with parent cfs_rq is not
> enough since the new cfs_rq is not yet enqueued on the hierarchy. A
> dequeue on other subtree on the throttled hierarchy can freeze the PELT
> clock for the parent hierarchy without setting the indicators for this
> newly added cfs_rq which was never enqueued.
> 
> Since there are no tasks on the new hierarchy, start a cfs_rq on a
> throttled hierarchy with its PELT clock throttled. The first enqueue, or
> the distribution (whichever happens first) will unfreeze the PELT clock
> and queue the cfs_rq on the leaf cfs_rq list.
> 
> While at it, add an assert_list_leaf_cfs_rq() in
> propagate_entity_cfs_rq() to catch such cases in the future.
> 
> Suggested-by: Aaron Lu <ziqianlu@...edance.com>
> Reported-by: Matteo Martelli <matteo.martelli@...ethink.co.uk>
> Closes: https://lore.kernel.org/lkml/58a587d694f33c2ea487c700b0d046fa@codethink.co.uk/
> Fixes: eb962f251fbb ("sched/fair: Task based throttle time accounting")
> Signed-off-by: K Prateek Nayak <kprateek.nayak@....com>
> ---
> Stress test included running sched-messaging in nested hierarchy with
> various quota set alongside a continuous loop of cgroup creation and
> deletion, as well as another loop of continuous movement of a busy loop
> between cgroups.
> 
> No splats have been observed yet with this patch.
> 
> Aaron, Matteo,
> 
> I've not added any "Tested-by" tags since the final diff is slightly
> different from the diff shared previously. ...

I applied this patch on top of commit 45b7f780739a ("sched: Fix some
typos in include/linux/preempt.h") from sched/core branch of tip tree,
and tested it with exactly the same setup I described in my previous
email[1]. With the patch applied, I couldn't reproduce the warning in 5
hours of testing, while before the patch the issue was systematically
reprodicible and the warning was being triggered at least once per
minute.

Tested-by: Matteo Martelli <matteo.martelli@...ethink.co.uk>

> ...

[1]: https://lore.kernel.org/all/e2e558b863c929c5019264b2ddefd4c0@codethink.co.uk/

Thanks to you and Aaron for addressing this!

Best regards,
Matteo Martelli