linux-kernel - Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <73b4bddb-335b-1f25-a203-199be546e44a@grsecurity.net>
Date:   Wed, 3 Nov 2021 10:51:50 +0100
From:   Mathias Krause <minipli@...ecurity.net>
To:     Michal Koutný <mkoutny@...e.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Odin Ugedal <odin@...d.al>
Cc:     Kevin Tanguy <kevin.tanguy@...p.ovh.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: task_group unthrottling and removal race (was Re: [PATCH]
 sched/fair: Use rq->lock when checking cfs_rq list) presence

Hi!

Am 02.11.21 um 17:02 schrieb Michal Koutný:
> [snip]
> I have a reproducer (attached) that can hit this window quite easily
> after tuning.  I can observe consequences of it even with a recent 5.15
> kernel. (And I also have reports from real world workloads failing due
> to a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on
> unthrottle").)

Thanks for the reproducer!

To provide yet another data point, Kevin (on Cc) is running into this
bug as well very reliable with a production workload, so we started
looking into this too. His crashes indicate a use-after-free of a cfs_rq
in update_blocked_averages(), much like you already diagnosed in your
initial patch description -- there are live cfs_rq's (on_list=1) in an
about to be kfree()'d task group in free_fair_sched_group().

His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object which
makes it incidentally overlap with SLUB's freelist pointer. That in
combination with SLAB_FREELIST_HARDENED's freelist pointer mangling
leads to a reliable access violation in form of a #GP which allowed us
to make the UAF fail fast.

As the real root cause cannot be seen from the crash backtrace only, we
tested a debug patch (attached) that unveiled that the real offender is
tg_unthrottle_up() getting called via sched_cfs_period_timer() via the
timer interrupt at an inconvenient time, namely when
unregister_fair_sched_group() unlinks all cfs_rq's from the dying task
group. It doesn't protect itself from getting interrupted, so if the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group in sched_offline_group(), sched_cfs_period_timer() will
execute and walk the list of task groups, trying to unthrottle cfs_rq's
and possibly re-add them to the dying task group. These will later -- in
free_fair_sched_group() -- be kfree()'ed while still being linked,
leading to the fireworks Kevin and you are seeing.

We tried the below patch which, unfortunately, doesn't fix the issue. So
there must be something else. :(

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 978460f891a1..afee07e9faf9 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9506,13 +9506,17 @@ void sched_offline_group(struct task_group *tg)
 {
 	unsigned long flags;

-	/* End participation in shares distribution: */
-	unregister_fair_sched_group(tg);
-
+	/*
+	 * Unlink first, to avoid walk_tg_tree_from() from finding us
+	 * (via sched_cfs_period_timer()).
+	 */
 	spin_lock_irqsave(&task_group_lock, flags);
 	list_del_rcu(&tg->list);
 	list_del_rcu(&tg->siblings);
 	spin_unlock_irqrestore(&task_group_lock, flags);
+
+	/* End participation in shares distribution: */
+	unregister_fair_sched_group(tg);
 }

 static void sched_change_group(struct task_struct *tsk, int type)

> [snip]
> 
> The possibilities for the current problem:
> 
> 1) Revert a7b359fc6a37 ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and its fixups.
> (Not exclusive with the other suggestions, rather a stop-gap for the
> time being.)
> 
> 2) Don't add offlined task_groups into the undecayed list
> - Your proposal with overloaded on_list=2 could serve as mark of that,
>   but it's a hack IMO.

> - Proper way (tm) would be to use css_tryget_online() and css_put() when
>   dealing with the list (my favorite at the moment).

That might work, as -- at least in Kevin's case -- it all gets triggered
by a dying cgroup.

> 3) Narrowing the race-window dramatically
> - that is by moving list removal from unregister_fair_sched_group() to
>   free_fair_sched_group(),

That might work, too. However, the unlinking needs protection against
the timer interrupt (and other sources?) which might try to re-add
entries. Or won't that happen any more, as at lesat one RCU GP has
passed? Anyhow, the kfree() calls likely would need to become
kfree_rcu() to handle concurrent traversal of cfs_rq's.

> - <del>or use list_empty(tg->list) as indicator whether we're working
>   with onlined task_group.</del> (won't work for RCU list)
> 
> 4) Rework how throttled load is handled (hand waving)
> - There is remove_entity_load_avg() that moves the load to parent upon
>   final removal. Maybe it could be generalized for temporary removals by
>   throttling (so that unthrottling could again add only non-empty
>   cfs_rqs to the list and undecayed load won't skew fairness).
> - or the way of [1].
> 
> 5) <your ideas>

It should be something that can be backported to stable kernels, as this
seem to affect v5.13, too.


Thanks,
Mathias

View attachment "5.14-sched-fair-dbg.diff" of type "text/x-patch" (2591 bytes)