linux-kernel - Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87o906wimo.fsf@x220.int.ebiederm.org>
Date:   Fri, 30 Aug 2019 14:36:15 -0500
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Linus Torvalds <torvalds@...ux-foundation.org>
Cc:     Oleg Nesterov <oleg@...hat.com>,
        Russell King - ARM Linux admin <linux@...linux.org.uk>,
        Peter Zijlstra <peterz@...radead.org>,
        Chris Metcalf <cmetcalf@...hip.com>,
        Christoph Lameter <cl@...ux.com>,
        Kirill Tkhai <tkhai@...dex.ru>, Mike Galbraith <efault@....de>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...nel.org>,
        Linux List Kernel Mailing <linux-kernel@...r.kernel.org>
Subject: Re: [BUG] Use of probe_kernel_address() in task_rcu_dereference() without checking return value

Linus Torvalds <torvalds@...ux-foundation.org> writes:

> On Fri, Aug 30, 2019 at 9:10 AM Oleg Nesterov <oleg@...hat.com> wrote:
>>
>>
>> Yes, please see
>>
>>         [PATCH 2/3] introduce probe_slab_address()
>>         https://lore.kernel.org/lkml/20141027195425.GC11736@redhat.com/
>>
>> I sent 5 years ago ;) Do you think
>>
>>         /*
>>          * Same as probe_kernel_address(), but @addr must be the valid pointer
>>          * to a slab object, potentially freed/reused/unmapped.
>>          */
>>         #ifdef CONFIG_DEBUG_PAGEALLOC
>>         #define probe_slab_address(addr, retval)        \
>>                 probe_kernel_address(addr, retval)
>>         #else
>>         #define probe_slab_address(addr, retval)        \
>>                 ({                                                      \
>>                         (retval) = *(typeof(retval) *)(addr);           \
>>                         0;                                              \
>>                 })
>>         #endif
>>
>> can work?
>
> Ugh. I would much rather handle the general case, because honestly,
> tracing has had a lot of issues with our hacky "probe_kernel_read()"
> stuff that bases itself on user addresses.
>
> It's also one of the few remaining users of "set_fs()" in core code,
> and we really should try to get rid of those.
>
> So your code would work for this particular case, but not for other
> cases that can trap simply because the pointer isn't reliable (tracing
> being the main case for that - but if the source of the pointer itself
> might have been free'd, you would also have that situation).
>
> So I'd really prefer to have something a bit fancier. On most
> architectures, doing a good exception fixup for kernel addresses is
> really not that hard.
>
> On x86, for example, we actually have *exactly* that. The
> "__get_user_asm()" macro is basically it. It purely does a load
> instruction from an unchecked address.
>
> (It's a really odd syntax, but you could remove the __chk_user_ptr()
> from the __get_user_size() macro, and now you'd have basically a "any
> regular size kernel access with exception handlng").
>
> But yes, your hack is I guess optimal for this particular case where
> you simply can depend on "we know the pointer was valid, we just don't
> know if it was freed".
>
> Hmm. Don't we RCU-free the task struct? Because then we don't even
> need to care about CONFIG_DEBUG_PAGEALLOC. We can just always access
> the pointer as long as we have the RCU read lock. We do that in other
> cases.

Sort of.  The rcu delay happens when release_task calls
delayed_put_task_struct.  Which unfortunately means that anytime after
exit_notify, release_task can operate on a task.  So it is possible
that by the time do_dead_task is called the rcu grace period is up.


Which is the problem the users of task_rcu_dereference are fighting.
They are performing an rcu walk on the set of cups in task_numa_migrate
and in the userspace membarrier system calls.

For a short while we the rcu delay in put_task_struct but that required
changes all of the place and was just a pain to work with.

Then I did:
> commit 8c7904a00b06d2ee51149794b619e07369fcf9d4
> Author: Eric W. Biederman <ebiederm@...ssion.com>
> Date:   Fri Mar 31 02:31:37 2006 -0800
> 
>     [PATCH] task: RCU protect task->usage
>     
>     A big problem with rcu protected data structures that are also reference
>     counted is that you must jump through several hoops to increase the reference
>     count.  I think someone finally implemented atomic_inc_not_zero(&count) to
>     automate the common case.  Unfortunately this means you must special case the
>     rcu access case.
>     
>     When data structures are only visible via rcu in a manner that is not
>     determined by the reference count on the object (i.e.  tasks are visible until
>     their zombies are reaped) there is a much simpler technique we can employ.
>     Simply delaying the decrement of the reference count until the rcu interval is
>     over.
>     
>     What that means is that the proc code that looks up a task and later
>     wants to sleep can now do:
>     
>     rcu_read_lock();
>     task = find_task_by_pid(some_pid);
>     if (task) {
>             get_task_struct(task);
>     }
>     rcu_read_unlock();
>     
>     The effect on the rest of the kernel is that put_task_struct becomes cheaper
>     and immediate, and in the case where the task has been reaped it frees the
>     task immediate instead of unnecessarily waiting an until the rcu interval is
>     over.
>     
>     Cleanup of task_struct does not happen when its reference count drops to
>     zero, instead cleanup happens when release_task is called.  Tasks can only
>     be looked up via rcu before release_task is called.  All rcu protected
>     members of task_struct are freed by release_task.
>     
>     Therefore we can move call_rcu from put_task_struct into release_task.  And
>     we can modify release_task to not immediately release the reference count
>     but instead have it call put_task_struct from the function it gives to
>     call_rcu.
>     
>     The end result:
>     
>     - get_task_struct is safe in an rcu context where we have just looked
>       up the task.
>     
>     - put_task_struct() simplifies into its old pre rcu self.
>     
>     This reorganization also makes put_task_struct uncallable from modules as
>     it is not exported but it does not appear to be called from any modules so
>     this should not be an issue, and is trivially fixed.
>     
>     Signed-off-by: Eric W. Biederman <ebiederm@...ssion.com>
>     Signed-off-by: Andrew Morton <akpm@...l.org>
>     Signed-off-by: Linus Torvalds <torvalds@...l.org>

About a decade later task_struct grew some new rcu users and Oleg
introduced task_rcu_dereference to deal with them:

> commit 150593bf869393d10a79f6bd3df2585ecc20a9bb
> Author: Oleg Nesterov <oleg@...hat.com>
> Date:   Wed May 18 19:02:18 2016 +0200
> 
>     sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
>     
>     Generally task_struct is only protected by RCU if it was found on a
>     RCU protected list (say, for_each_process() or find_task_by_vpid()).
>     
>     As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
>     drops the (potentially) last reference without RCU gp, this means
>     that we need to fix the code which uses foreign_rq->curr under
>     rcu_read_lock().
>     
>     Add a new helper which can be used to dereference rq->curr or any
>     other pointer to task_struct assuming that it should be cleared or
>     updated before the final put_task_struct(). It returns non-NULL
>     only if this task can't go away before rcu_read_unlock().
>     
>     ( Also add try_get_task_struct() to make it easier to use this API
>       correctly. )

So I think it makes a lot of sense to change how we do this.  Either
moving the rcu delay back into put_task_struct or doing halfway like
creating a put_dead_task_struct that will perform the rcu delay after
a task has been taken off the run queues and has stopped being a zombie.

Something like:
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 0497091e40c1..bf323418094e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -115,7 +115,7 @@ static inline void put_task_struct(struct task_struct *t)
 		__put_task_struct(t);
 }
 
-struct task_struct *task_rcu_dereference(struct task_struct **ptask);
+void put_dead_task_struct(struct task_struct *task);
 
 #ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
 extern int arch_task_struct_size __read_mostly;
diff --git a/kernel/exit.c b/kernel/exit.c
index 5b4a5dcce8f8..3a85bc2e8031 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -182,6 +182,24 @@ static void delayed_put_task_struct(struct rcu_head *rhp)
 	put_task_struct(tsk);
 }
 
+void put_dead_task_struct(struct task_struct *task)
+{
+	bool delay = false;
+	unsigned long flags;
+
+	/* Is the task both reaped and no longer being scheduled? */
+	raw_spin_lock_irqsave(&task->pi_lock, flags);
+	if ((task->state == TASK_DEAD) &&
+	    (cmpxchg(&task->exit_state, EXIT_DEAD, EXIT_RCU) == EXIT_DEAD))
+		delay = true;
+	raw_spin_lock_irqrestore(&task->pi_lock, flags);
+
+	/* If both are true use rcu delay the put_task_struct */
+	if (delay)
+		call_rcu(&task->rcu, delayed_put_task_struct);
+	else
+		put_task_struct(task);
+}
 
 void release_task(struct task_struct *p)
 {
@@ -222,76 +240,13 @@ void release_task(struct task_struct *p)
 
 	write_unlock_irq(&tasklist_lock);
 	release_thread(p);
-	call_rcu(&p->rcu, delayed_put_task_struct);
+	put_dead_task_struct(p);
 
 	p = leader;
 	if (unlikely(zap_leader))
 		goto repeat;
 }
 
-/*
- * Note that if this function returns a valid task_struct pointer (!NULL)
- * task->usage must remain >0 for the duration of the RCU critical section.
- */
-struct task_struct *task_rcu_dereference(struct task_struct **ptask)
-{
-	struct sighand_struct *sighand;
-	struct task_struct *task;
-
-	/*
-	 * We need to verify that release_task() was not called and thus
-	 * delayed_put_task_struct() can't run and drop the last reference
-	 * before rcu_read_unlock(). We check task->sighand != NULL,
-	 * but we can read the already freed and reused memory.
-	 */
-retry:
-	task = rcu_dereference(*ptask);
-	if (!task)
-		return NULL;
-
-	probe_kernel_address(&task->sighand, sighand);
-
-	/*
-	 * Pairs with atomic_dec_and_test() in put_task_struct(). If this task
-	 * was already freed we can not miss the preceding update of this
-	 * pointer.
-	 */
-	smp_rmb();
-	if (unlikely(task != READ_ONCE(*ptask)))
-		goto retry;
-
-	/*
-	 * We've re-checked that "task == *ptask", now we have two different
-	 * cases:
-	 *
-	 * 1. This is actually the same task/task_struct. In this case
-	 *    sighand != NULL tells us it is still alive.
-	 *
-	 * 2. This is another task which got the same memory for task_struct.
-	 *    We can't know this of course, and we can not trust
-	 *    sighand != NULL.
-	 *
-	 *    In this case we actually return a random value, but this is
-	 *    correct.
-	 *
-	 *    If we return NULL - we can pretend that we actually noticed that
-	 *    *ptask was updated when the previous task has exited. Or pretend
-	 *    that probe_slab_address(&sighand) reads NULL.
-	 *
-	 *    If we return the new task (because sighand is not NULL for any
-	 *    reason) - this is fine too. This (new) task can't go away before
-	 *    another gp pass.
-	 *
-	 *    And note: We could even eliminate the false positive if re-read
-	 *    task->sighand once again to avoid the falsely NULL. But this case
-	 *    is very unlikely so we don't care.
-	 */
-	if (!sighand)
-		return NULL;
-
-	return task;
-}
-
 void rcuwait_wake_up(struct rcuwait *w)
 {
 	struct task_struct *task;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2b037f195473..5b697c0572ce 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct *prev)
 		/* Task is done with its stack. */
 		put_task_stack(prev);
 
-		put_task_struct(prev);
+		put_dead_task_struct(prev);
 	}
 
 	tick_nohz_task_switch();
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc9cfeaac8bd..c3e1a302211a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1644,7 +1644,7 @@ static void task_numa_compare(struct task_numa_env *env,
 		return;
 
 	rcu_read_lock();
-	cur = task_rcu_dereference(&dst_rq->curr);
+	cur = rcu_dereference(&dst_rq->curr);
 	if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
 		cur = NULL;
 
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c
index aa8d75804108..74df8e0dfc84 100644
--- a/kernel/sched/membarrier.c
+++ b/kernel/sched/membarrier.c
@@ -71,7 +71,7 @@ static int membarrier_global_expedited(void)
 			continue;
 
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(&cpu_rq(cpu)->curr);
 		if (p && p->mm && (atomic_read(&p->mm->membarrier_state) &
 				   MEMBARRIER_STATE_GLOBAL_EXPEDITED)) {
 			if (!fallback)
@@ -150,7 +150,7 @@ static int membarrier_private_expedited(int flags)
 		if (cpu == raw_smp_processor_id())
 			continue;
 		rcu_read_lock();
-		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
+		p = rcu_dereference(&cpu_rq(cpu)->curr);
 		if (p && p->mm == current->mm) {
 			if (!fallback)
 				__cpumask_set_cpu(cpu, tmpmask);



Eric