[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <73b4bddb-335b-1f25-a203-199be546e44a@grsecurity.net>
Date: Wed, 3 Nov 2021 10:51:50 +0100
From: Mathias Krause <minipli@...ecurity.net>
To: Michal Koutný <mkoutny@...e.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Odin Ugedal <odin@...d.al>
Cc: Kevin Tanguy <kevin.tanguy@...p.ovh.com>,
Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: task_group unthrottling and removal race (was Re: [PATCH]
sched/fair: Use rq->lock when checking cfs_rq list) presence
Hi!
Am 02.11.21 um 17:02 schrieb Michal Koutný:
> [snip]
> I have a reproducer (attached) that can hit this window quite easily
> after tuning. I can observe consequences of it even with a recent 5.15
> kernel. (And I also have reports from real world workloads failing due
> to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
> unthrottle").)
Thanks for the reproducer!
To provide yet another data point, Kevin (on Cc) is running into this
bug as well very reliable with a production workload, so we started
looking into this too. His crashes indicate a use-after-free of a cfs_rq
in update_blocked_averages(), much like you already diagnosed in your
initial patch description -- there are live cfs_rq's (on_list=1) in an
about to be kfree()'d task group in free_fair_sched_group().
His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object which
makes it incidentally overlap with SLUB's freelist pointer. That in
combination with SLAB_FREELIST_HARDENED's freelist pointer mangling
leads to a reliable access violation in form of a #GP which allowed us
to make the UAF fail fast.
As the real root cause cannot be seen from the crash backtrace only, we
tested a debug patch (attached) that unveiled that the real offender is
tg_unthrottle_up() getting called via sched_cfs_period_timer() via the
timer interrupt at an inconvenient time, namely when
unregister_fair_sched_group() unlinks all cfs_rq's from the dying task
group. It doesn't protect itself from getting interrupted, so if the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group in sched_offline_group(), sched_cfs_period_timer() will
execute and walk the list of task groups, trying to unthrottle cfs_rq's
and possibly re-add them to the dying task group. These will later -- in
free_fair_sched_group() -- be kfree()'ed while still being linked,
leading to the fireworks Kevin and you are seeing.
We tried the below patch which, unfortunately, doesn't fix the issue. So
there must be something else. :(
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 978460f891a1..afee07e9faf9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9506,13 +9506,17 @@ void sched_offline_group(struct task_group *tg)
{
unsigned long flags;
- /* End participation in shares distribution: */
- unregister_fair_sched_group(tg);
-
+ /*
+ * Unlink first, to avoid walk_tg_tree_from() from finding us
+ * (via sched_cfs_period_timer()).
+ */
spin_lock_irqsave(&task_group_lock, flags);
list_del_rcu(&tg->list);
list_del_rcu(&tg->siblings);
spin_unlock_irqrestore(&task_group_lock, flags);
+
+ /* End participation in shares distribution: */
+ unregister_fair_sched_group(tg);
}
static void sched_change_group(struct task_struct *tsk, int type)
> [snip]
>
> The possibilities for the current problem:
>
> 1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
> (Not exclusive with the other suggestions, rather a stop-gap for the
> time being.)
>
> 2) Don't add offlined task_groups into the undecayed list
> - Your proposal with overloaded on_list=2 could serve as mark of that,
> but it's a hack IMO.
> - Proper way (tm) would be to use css_tryget_online() and css_put() when
> dealing with the list (my favorite at the moment).
That might work, as -- at least in Kevin's case -- it all gets triggered
by a dying cgroup.
> 3) Narrowing the race-window dramatically
> - that is by moving list removal from unregister_fair_sched_group() to
> free_fair_sched_group(),
That might work, too. However, the unlinking needs protection against
the timer interrupt (and other sources?) which might try to re-add
entries. Or won't that happen any more, as at lesat one RCU GP has
passed? Anyhow, the kfree() calls likely would need to become
kfree_rcu() to handle concurrent traversal of cfs_rq's.
> - <del>or use list_empty(tg->list) as indicator whether we're working
> with onlined task_group.</del> (won't work for RCU list)
>
> 4) Rework how throttled load is handled (hand waving)
> - There is remove_entity_load_avg() that moves the load to parent upon
> final removal. Maybe it could be generalized for temporary removals by
> throttling (so that unthrottling could again add only non-empty
> cfs_rqs to the list and undecayed load won't skew fairness).
> - or the way of [1].
>
> 5) <your ideas>
It should be something that can be backported to stable kernels, as this
seem to affect v5.13, too.
Thanks,
Mathias
View attachment "5.14-sched-fair-dbg.diff" of type "text/x-patch" (2591 bytes)
Powered by blists - more mailing lists