linux-kernel - Re: task_group unthrottling and removal race (was Re: [PATCH] sched/fair: Use rq->lock when checking cfs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b98e3434-67cd-34b7-9e81-148ea31a851c@grsecurity.net>
Date:   Wed, 3 Nov 2021 15:16:33 +0100
From:   Mathias Krause <minipli@...ecurity.net>
To:     Michal Koutný <mkoutny@...e.com>
Cc:     Vincent Guittot <vincent.guittot@...aro.org>,
        Odin Ugedal <odin@...d.al>,
        Kevin Tanguy <kevin.tanguy@...p.ovh.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: task_group unthrottling and removal race (was Re: [PATCH]
 sched/fair: Use rq->lock when checking cfs_rq list) presence

Am 03.11.21 um 12:10 schrieb Michal Koutný:
> On Wed, Nov 03, 2021 at 11:51:12AM +0100, Mathias Krause <minipli@...ecurity.net> wrote:
>> Adding synchronize_rcu() here will ensure all concurrent RCU "readers"
>> will have finished what they're doing, so we can unlink safely. That
>> was, apparently, the missing piece.
> 
> What reader(s) are you referring to here? The
> list_for_each_entry_rcu(cfs_rq, &cfs_b->throttled_cfs_rq,
> throttled_list) in distribute_cfs_runtime()?

Let me quote a little bit more context here to explain my rationale:

>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 978460f891a1..afee07e9faf9 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -9506,13 +9506,17 @@ void sched_offline_group(struct task_group *tg)
>>  {
>>  	unsigned long flags;
>>
>> -	/* End participation in shares distribution: */
>> -	unregister_fair_sched_group(tg);
>> -
>> +	/*
>> +	 * Unlink first, to avoid walk_tg_tree_from() from finding us
>> +	 * (via sched_cfs_period_timer()).
>> +	 */

>>  	spin_lock_irqsave(&task_group_lock, flags);
>>  	list_del_rcu(&tg->list);
>>  	list_del_rcu(&tg->siblings);
>>  	spin_unlock_irqrestore(&task_group_lock, flags);

The above list_del_rcu() implies there are users that use RCU semantics
to walk the lists, namely tg->list and tg->siblings. And, in fact,
walk_tg_tree_from() does just that:

    [...]
    list_for_each_entry_rcu(child, &parent->children, siblings) {
        [...]
    }

Now, taking the task_group_lock in sched_offline_group() just prevents
multiple users from entering this code path. But walk_tg_tree_from()
does not take that lock, so can walk the task group hierarchy concurrently.

Disabling IRQs also doesn't protect us from do_sched_cfs_period_timer()
running in parallel, as the timer interrupt might fire on a different
CPU than the one we're currently running on.

So, assuming the scenario in race.txt (sorry, my mailer is just too dump
for bigger ASCI art), CPU2 will walk the task group list in parallel to
sched_offline_group(), specifically it'll read the soon to be unlinked
task group entry in (1). So removing it on CPU1 in (2) won't prevent
CPU2 from still passing it on to tg_unthrottle_up(). Assuming
unregister_fair_sched_group() on CPU1 now tries to unlink all cfs_rq's
via calls to list_del_leaf_cfs_rq(), CPU2 will re-add some in (3), which
is the cause of the UAF later on. Now, if we would add a
synchronize_rcu() prior to calling unregister_fair_sched_group(), we
would wait until the walk on CPU2 has finished at (4). Afterwards each
new walk won't see the dying task group any more, thereby preventing the
UAF from happening. Or am I missing something?

>> +
>> +	/* End participation in shares distribution: */
> 
> Adding synchronize_rcu() here will ensure all concurrent RCU "readers"
> will have finished what they're doing, so we can unlink safely. That
> was, apparently, the missing piece.
> 
>> +	unregister_fair_sched_group(tg);
>>  }
>>
>>  static void sched_change_group(struct task_struct *tsk, int type)
>>

Btw, the above patch (plus the warning debug patch) is running your
reproducer here for 3.5h without triggering any warnings. It used to
trigger them within the first few minutes.

> I'm think (not sure) that wouldn't work since the unthrottle_cfs_rq can
> still be called after this synchronize_rcu() but before
> free_fair_sched_group().

Sure, but it won't see the dying task group any more. So won't re-add
cfs_rq's to them, preventing the UAF from happening.

> (But if you considered update_blocked_averages() as the reader(s) and
> synchronize_rcu() within free_fair_sched_group(), that may ensure UBA
> won't step on a free'd cfs_rq (+unlinking would need to happen only in
> free_fair_sched_group() too.))

I'd prefer not linking cfs_rq's to a dead tg. Trying to cover up just
before we're about to free the memory feels hacky.

Thanks,
Mathias

View attachment "race.txt" of type "text/plain" (1954 bytes)