[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250929105518.GB426@bytedance>
Date: Mon, 29 Sep 2025 18:55:18 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: K Prateek Nayak <kprateek.nayak@....com>
Cc: Valentin Schneider <vschneid@...hat.com>,
Ben Segall <bsegall@...gle.com>,
Peter Zijlstra <peterz@...radead.org>,
Chengming Zhou <chengming.zhou@...ux.dev>,
Josh Don <joshdon@...gle.com>, Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Xi Wang <xii@...gle.com>, linux-kernel@...r.kernel.org,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>, Mel Gorman <mgorman@...e.de>,
Chuyi Zhou <zhouchuyi@...edance.com>,
Jan Kiszka <jan.kiszka@...mens.com>,
Florian Bezdeka <florian.bezdeka@...mens.com>,
Songtang Liu <liusongtang@...edance.com>,
Chen Yu <yu.c.chen@...el.com>,
Matteo Martelli <matteo.martelli@...ethink.co.uk>,
Michal Koutný <mkoutny@...e.com>,
Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Subject: Re: [PATCH] sched/fair: Prevent cfs_rq from being unthrottled with
zero runtime_remaining
Hi Prateek,
Thanks for taking a look and the suggestion.
On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote:
> Hello Aaron,
>
> On 9/29/2025 1:16 PM, Aaron Lu wrote:
> > When a cfs_rq is to be throttled, its limbo list should be empty and
> > that's why there is a warn in tg_throttle_down() for non empty
> > cfs_rq->throttled_limbo_list.
> >
> > When running a test with the following hierarchy:
> >
> > root
> > / \
> > A* ...
> > / | \ ...
> > B
> > / \
> > C*
> >
> > where both A and C have quota settings, that warn on non empty limbo list
> > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
> > part of the cfs_rq for the sake of simpler representation).
> >
> > Debugging showed it happened like this:
> > Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
> > cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
> > equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
> > *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
> > group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
> > called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
> > these throttled tasks are placed into cfs_rq_c's limbo list by
> > enqueue_throttled_task().
> >
> > Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
> > tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
> > runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
> > can't get more runtime and enters tg_throttle_down(), where the warning
> > is hit due to remaining tasks in the limbo list.
> >
> > Fix this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth()
> > immediately after enabling bandwidth and setting runtime_remaining = 0.
> > This ensures cfs_rq_c is throttled upfront and cannot enter the enqueue
> > path in an unthrottled state with no runtime.
> >
> > Also, update outdated comments in tg_throttle_down() since
> > unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
> >
> > While at it, remove a redundant assignment to se in tg_throttle_down().
> >
> > Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
> > Signed-off-by: Aaron Lu <ziqianlu@...edance.com>
> > ---
> > kernel/sched/core.c | 9 ++++++++-
> > kernel/sched/fair.c | 16 +++++++---------
> > kernel/sched/sched.h | 1 +
> > 3 files changed, 16 insertions(+), 10 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 7f1e5cb94c536..421166d431fa7 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -9608,7 +9608,14 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
> > cfs_rq->runtime_enabled = runtime_enabled;
> > cfs_rq->runtime_remaining = 0;
> >
> > - if (cfs_rq->throttled)
> > + /*
> > + * Throttle cfs_rq now or it can be unthrottled with zero
> > + * runtime_remaining and gets throttled on its unthrottle path.
> > + */
> > + if (cfs_rq->runtime_enabled && !cfs_rq->throttled)
> > + throttle_cfs_rq(cfs_rq);
>
> So one downside of this is throttle_cfs_rq() here can assign bandwidth
> to an empty cfs_rq and a genuine enqueue later on another CPU might not
> find bandwidth thus delaying its execution.
Agree that assign doesn't make sense here.
>
> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task()
> if we find cfs_rq->throttled_limbo_list to be empty?
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18a30ae35441..fd2d4dad9c27 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p)
> */
> if (throttled_hierarchy(cfs_rq) &&
> !task_current_donor(rq_of(cfs_rq), p)) {
> + if (list_empty(&cfs_rq->throttled_limbo_list))
> + check_enqueue_throttle(cfs_rq);
> list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list);
> return true;
> }
> ---
>
Works for me, will follow your suggestion if no other comments, thanks!
Powered by blists - more mailing lists