linux-kernel - Re: arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180215142239.GF16623@arm.com>
Date:   Thu, 15 Feb 2018 14:22:39 +0000
From:   Will Deacon <will.deacon@....com>
To:     Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
Cc:     Mark Rutland <mark.rutland@....com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Peter Zijlstra <peterz@...radead.org>
Subject: Re: arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch

On Wed, Feb 14, 2018 at 06:53:44PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 14, 2018, at 11:51 AM, Mark Rutland mark.rutland@....com wrote:
> > On Wed, Feb 14, 2018 at 03:07:41PM +0000, Will Deacon wrote:
> >> If the exit()ing task had recently migrated from another CPU, then that
> >> CPU could concurrently run context_switch() and take this path:
> >> 
> >> 	if (!prev->mm) {
> >> 		prev->active_mm = NULL;
> >> 		rq->prev_mm = oldmm;
> >> 	}
> > 
> > IIUC, on the prior context_switch, next->mm == NULL, so we set
> > next->active_mm to prev->mm.
> > 
> > Then, in this context_switch we set oldmm = prev->active_mm (where prev
> > is next from the prior context switch).
> > 
> > ... right?
> > 
> >> which then means finish_task_switch will call mmdrop():
> >> 
> >> 	struct mm_struct *mm = rq->prev_mm;
> >> 	[...]
> >> 	if (mm) {
> >> 		membarrier_mm_sync_core_before_usermode(mm);
> >> 		mmdrop(mm);
> >> 	}
> > 
> > ... then here we use what was prev->active_mm in the most recent context
> > switch.
> > 
> > So AFAICT, we're never concurrently accessing a task_struct::mm field
> > here, only prev::{mm,active_mm} while prev is current...
> > 
> > [...]
> > 
> >> diff --git a/kernel/exit.c b/kernel/exit.c
> >> index 995453d9fb55..f91e8d56b03f 100644
> >> --- a/kernel/exit.c
> >> +++ b/kernel/exit.c
> >> @@ -534,8 +534,9 @@ static void exit_mm(void)
> >>         }
> >>         mmgrab(mm);
> >>         BUG_ON(mm != current->active_mm);
> >> -       /* more a memory barrier than a real lock */
> >>         task_lock(current);
> >> +       /* Ensure we've grabbed the mm before setting current->mm to NULL */
> >> +       smp_mb__after_spin_lock();
> >>         current->mm = NULL;
> > 
> > ... and thus I don't follow why we would need to order these with
> > anything more than a compiler barrier (if we're preemptible here).
> > 
> > What have I completely misunderstood? ;)
> 
> The compiler barrier would not change anything, because task_lock()
> already implies a compiler barrier (provided by the arch spin lock
> inline asm memory clobber). So compiler-wise, it cannot move the
> mmgrab(mm) after the store "current->mm = NULL".
> 
> However, given the scenario involves multiples CPUs (one doing exit_mm(),
> the other doing context switch), the actual order of perceived load/store
> can be shuffled. And AFAIU nothing prevents the CPU from ordering the
> atomic_inc() done by mmgrab(mm) _after_ the store to current->mm.

Mark and I have spent most of the morning looking at this and realised I
made a mistake in my original guesswork: prev can't migrate until half way
down finish_task_switch when on_cpu = 0, so the access of prev->mm in
context_switch can't race with exit_mm() for that task.

Furthermore, although the mmgrab() could in theory be reordered with
current->mm = NULL (and the ARMv8 architecture allows this too), it's
pretty unlikely with LL/SC atomics and the backwards branch, where the
CPU would have to pull off quite a few tricks for this to happen.

Instead, we've come up with a more plausible sequence that can in theory
happen on a single CPU:

<task foo calls exit()>

do_exit
	exit_mm
		mmgrab(mm);			// foo's mm has count +1
		BUG_ON(mm != current->active_mm);
		task_lock(current);
		current->mm = NULL;
		task_unlock(current);

<irq and ctxsw to kthread>

context_switch(prev=foo, next=kthread)
	mm = next->mm;
	oldmm = prev->active_mm;

	if (!mm) {				// True for kthread
		next->active_mm = oldmm;
		mmgrab(oldmm);			// foo's mm has count +2
	}

	if (!prev->mm) {			// True for foo
		rq->prev_mm = oldmm;
	}

	finish_task_switch
		mm = rq->prev_mm;
		if (mm) {			// True (foo's mm)
			mmdrop(mm);		// foo's mm has count +1
		}

	[...]

<ctxsw to task bar>

context_switch(prev=kthread, next=bar)
	mm = next->mm;
	oldmm = prev->active_mm;		// foo's mm!

	if (!prev->mm) {			// True for kthread
		rq->prev_mm = oldmm;
	}

	finish_task_switch
		mm = rq->prev_mm;
		if (mm) {			// True (foo's mm)
			mmdrop(mm);		// foo's mm has count +0
		}

	[...]

<ctxsw back to task foo>

context_switch(prev=bar, next=foo)
	mm = next->mm;
	oldmm = prev->active_mm;

	if (!mm) {				// True for foo
		next->active_mm = oldmm;	// This is bar's mm
		mmgrab(oldmm);			// bar's mm has count +1
	}


	[return back to exit_mm]

mmdrop(mm);					// foo's mm has count -1

At this point, we've got an imbalanced count on the mm and could free it
prematurely as seen in the KASAN log. A subsequent context-switch away
from foo would therefore result in a use-after-free.

Assuming others agree with this diagnosis, I'm not sure how to fix it.
It's basically not safe to set current->mm = NULL with preemption enabled.

Will