linux-kernel - Re: KASAN: use-after-free Read in task_is

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20181025155503.GF3725@redhat.com>
Date:   Thu, 25 Oct 2018 17:55:04 +0200
From:   Oleg Nesterov <oleg@...hat.com>
To:     Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Cc:     serge@...lyn.com,
        syzbot <syzbot+a9ac39bf55329e206219@...kaller.appspotmail.com>,
        jmorris@...ei.org, keescook@...omium.org,
        linux-kernel@...r.kernel.org,
        linux-security-module@...r.kernel.org,
        syzkaller-bugs@...glegroups.com
Subject: Re: KASAN: use-after-free Read in task_is_descendant

On 10/25, Tetsuo Handa wrote:
>
> On 2018/10/25 21:17, Oleg Nesterov wrote:
> >>> And yes, task_is_descendant() can hit the dead child, if nothing else it can
> >>> be killed. This can explain the kasan report.
> >>
> >> The kasan is reporting that child->real_parent (or maybe child->real_parent->real_parent
> >> or child->real_parent->real_parent->real_parent ...) was pointing to already freed memory,
> >> isn't it?
> >
> > Yes. and you know, I am all confused. I no longer can understand you :/
>
> Why don't we need to check every time like shown below?
> Why checking only once is sufficient?

Why do you think it is not sufficient?

Again, I can be easily wrong, rcu is not simple, but so far I think we need
a single check at the start.

> --- a/security/yama/yama_lsm.c
> +++ b/security/yama/yama_lsm.c
> @@ -285,7 +285,7 @@ static int task_is_descendant(struct task_struct *parent,
>  	rcu_read_lock();
>  	if (!thread_group_leader(parent))
>  		parent = rcu_dereference(parent->group_leader);
> -	while (walker->pid > 0) {
> +	while (pid_alive(walker) && walker->pid > 0) {

OK. To simplify, ets suppose that task_is_descendant() is called with tasklist
lock held. And lets suppose that all tasks are single-threaded.

Then we obviously need a single check at the start, we need to ensure that the
child was not removed from its ->real_parent->children list. The latter means
that if ->real_parent exits, the child will be re-parented and its ->real_parent
will be updated.

So we could do

	read_lock(tasklist);

	if (list_empty(child->sibling))
		// it is dead, removed from ->children list, we can't trust
		// child->real_parent
		return -EWHATEVER;

	task_is_descendant(current, child);

But note that we can safely use pid_alive(child) instead, detach_pid() and
list_del_init(&p->sibling) happen "at the same time" since we hold tasklist.

(And btw, I suggested several times to rename it, or add another helper with
 a better name. Note also that we could check, say, ->sighand != NULL with
 the same effect.)

Now. Why do you think rcu_read_lock() differs in that we need to check
pid_alive() at every step?

Suppose that one of the grand parents exits, and it is going to be freed. Again,
to (over)simplify the things, lets suppose that release_task() does

	synchronize_rcu();
	free_task(p);

at the end. Now, can

	rcu_read_lock();
	if (pid_alive(child)) {
		while (child->pid)
			child = child->real_parent;
	}
	rcu_read_unlock();

hit the already freed ->real_parent ? Say, the freed child->real_parent->real_parent.

Lets denote P1 = child->real_parent, P2 = P1->real_parent. Can P2 be already freed?

This is only possible if synchronize_rcu() above was called before rcu_read_lock(),
see the last sentence below.

If P1->real_parent is still P2, then P1 has already exited too. And we still observe
that child->real_parent == P1, this too is only possible if child has exited, so we
must see pid_alive() == F.

Why must we see pid_alive() == F without tasklist? It must be true, release_task()
is serialized by tasklist_lock, but why we can't get the stale value under
rcu_read_lock() ?

Because our rcu read-lock critical section extends beyond the return from
synchronize_rcu(), and thus we must have a full memory barrier _between_
that synchronize_rcu() and our rcu_read_lock(). We must see all memory updates,
including thread_pid = NULL which makes pid_alive() == F.

Do you see any hole?

Oleg.