linux-kernel - Re: [PATCH] mm, oom: allow oom reaper to race with exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170727065023.GB20970@dhcp22.suse.cz>
Date:   Thu, 27 Jul 2017 08:50:24 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Andrea Arcangeli <aarcange@...hat.com>
Cc:     "Kirill A. Shutemov" <kirill@...temov.name>,
        Andrew Morton <akpm@...ux-foundation.org>,
        David Rientjes <rientjes@...gle.com>,
        Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
        Oleg Nesterov <oleg@...hat.com>,
        Hugh Dickins <hughd@...gle.com>, linux-mm@...ck.org,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap

On Wed 26-07-17 18:29:12, Andrea Arcangeli wrote:
> On Wed, Jul 26, 2017 at 07:45:57AM +0200, Michal Hocko wrote:
> > On Tue 25-07-17 21:19:52, Andrea Arcangeli wrote:
> > > On Tue, Jul 25, 2017 at 06:04:00PM +0200, Michal Hocko wrote:
> > > > -	down_write(&mm->mmap_sem);
> > > > +	if (tsk_is_oom_victim(current))
> > > > +		down_write(&mm->mmap_sem);
> > > >  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
> > > >  	tlb_finish_mmu(&tlb, 0, -1);
> > > >  
> > > > @@ -3012,7 +3014,8 @@ void exit_mmap(struct mm_struct *mm)
> > > >  	}
> > > >  	mm->mmap = NULL;
> > > >  	vm_unacct_memory(nr_accounted);
> > > > -	up_write(&mm->mmap_sem);
> > > > +	if (tsk_is_oom_victim(current))
> > > > +		up_write(&mm->mmap_sem);
> > > 
> > > How is this possibly safe? mark_oom_victim can run while exit_mmap is
> > > running.
> > 
> > I believe it cannot. We always call mark_oom_victim (on !current) with
> > task_lock held and check task->mm != NULL and we call do_exit->mmput after
> > mm is set to NULL under the same lock.
> 
> Holding the mmap_sem for writing and setting mm->mmap to NULL to
> filter which tasks already released the mmap_sem for writing post
> free_pgtables still look unnecessary to solve this.
> 
> Using MMF_OOM_SKIP as flag had side effects of oom_badness() skipping
> it, but we can use the same tsk_is_oom_victim instead and relay on the
> locking in mark_oom_victim you pointed out above instead of the
> test_and_set_bit of my patch, because current->mm is already NULL at
> that point.
> 
> A race at the light of the above now is, because current->mm is NULL by the
> time mmput is called, how can you start the oom_reap_task on a process
> with current->mm NULL that called the last mmput and is blocked
> in exit_aio?

Because we have that mm available. See tsk->signal->oom_mm in
oom_reap_task

> It looks like no false positive can get fixed until this
> is solved first because 
> 
> Isn't this enough? If this is enough it avoids other modification to
> the exit_mmap runtime that looks unnecessary: mm->mmap = NULL replaced
> by MMF_OOM_SKIP that has to be set anyway by __mmput later and one
> unnecessary branch to call the up_write.
> 
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f19efcf75418..bdab595ce25c 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2993,6 +2993,23 @@ void exit_mmap(struct mm_struct *mm)
>  	/* Use -1 here to ensure all VMAs in the mm are unmapped */
>  	unmap_vmas(&tlb, vma, 0, -1);
>  
> +	set_bit(MMF_OOM_SKIP, &mm->flags);
> +	if (tsk_is_oom_victim(current)) {
> +		/*
> +		 * Wait for oom_reap_task() to stop working on this
> +		 * mm. Because MMF_OOM_SKIP is already set before
> +		 * calling down_read(), oom_reap_task() will not run
> +		 * on this "mm" post up_write().
> +		 *
> +		 * tsk_is_oom_victim() cannot be set from under us
> +		 * either because current->mm is already set to NULL
> +		 * under task_lock before calling mmput and oom_mm is
> +		 * set not NULL by the OOM killer only if current->mm
> +		 * is found not NULL while holding the task_lock.
> +		 */
> +		down_write(&mm->mmap_sem);
> +		up_write(&mm->mmap_sem);
> +	}
>  	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>  	tlb_finish_mmu(&tlb, 0, -1);

Yes this will work and it won't depend on the oom_lock. But isn't it
just more ugly than simply doing

	if (tsk_is_oom_victim) {
		down_write(&mm->mmap_sem);
		locked = true;
	}
	free_pgtables(...)
	[...]
	if (locked)
		down_up(&mm->mmap_sem);

in general I do not like empty locked sections much, to be honest. Now
with the conditional locking my patch looks as follows. It should be
pretty much equivalent to your patch. Would that be acceptable for you
or do you think there is a strong reason to go with yours?
---
>From 2198654be88d11efb1f372e8579761f65e219206 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@...e.com>
Date: Thu, 27 Jul 2017 08:48:15 +0200
Subject: [PATCH] mm, oom: allow oom reaper to race with exit_mmap

David has noticed that the oom killer might kill additional tasks while
the exiting oom victim hasn't terminated yet because the oom_reaper marks
the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
to 0. The race is as follows

oom_reap_task				do_exit
					  exit_mm
  __oom_reap_task_mm
					    mmput
					      __mmput
    mmget_not_zero # fails
    						exit_mmap # frees memory
  set_bit(MMF_OOM_SKIP)

The victim is still visible to the OOM killer until it is unhashed.

Currently we try to reduce a risk of this race by taking oom_lock
and wait for out_of_memory sleep while holding the lock to give the
victim some time to exit. This is quite suboptimal approach because
there is no guarantee the victim (especially a large one) will manage
to unmap its address space and free enough memory to the particular oom
domain which needs a memory (e.g. a specific NUMA node).

Fix this problem by allowing __oom_reap_task_mm and __mmput path to
race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
to run in parallel with other unmappers (hence the mmap_sem for read).

The only tricky part is to exclude page tables tear down and all
operations which modify the address space in the __mmput path. exit_mmap
doesn't expect any other users so it doesn't use any locking. Nothing
really forbids us to use mmap_sem for write, though. In fact we are
already relying on this lock earlier in the __mmput path to synchronize
with ksm and khugepaged.

Take the exclusive mmap_sem when calling free_pgtables and destroying
vmas to sync with __oom_reap_task_mm which take the lock for read. All
other operations can safely race with the parallel unmap.

Changes v1
- bail on null mm->mmap early as per David Rientjes
- take exclusive mmap_sem in exit_mmap only for oom victims to reduce
  the lock overhead

Reported-by: David Rientjes <rientjes@...gle.com>
Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
Signed-off-by: Michal Hocko <mhocko@...e.com>
---
 mm/mmap.c     | 16 ++++++++++++++++
 mm/oom_kill.c | 47 ++++++++---------------------------------------
 2 files changed, 24 insertions(+), 39 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 24e9261bdcc0..822e8860b9d2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -44,6 +44,7 @@
 #include <linux/userfaultfd_k.h>
 #include <linux/moduleparam.h>
 #include <linux/pkeys.h>
+#include <linux/oom.h>
 
 #include <linux/uaccess.h>
 #include <asm/cacheflush.h>
@@ -2967,6 +2968,7 @@ void exit_mmap(struct mm_struct *mm)
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
 	unsigned long nr_accounted = 0;
+	bool locked = false;
 
 	/* mm's last user has gone, and its about to be pulled down */
 	mmu_notifier_release(mm);
@@ -2993,6 +2995,17 @@ void exit_mmap(struct mm_struct *mm)
 	/* Use -1 here to ensure all VMAs in the mm are unmapped */
 	unmap_vmas(&tlb, vma, 0, -1);
 
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables or unmap VMAs under its feet
+	 * Please note that mark_oom_victim is always called under task_lock
+	 * with tsk->mm != NULL checked on !current tasks which synchronizes
+	 * with exit_mm and so we cannot race here.
+	 */
+	if (tsk_is_oom_victim(current)) {
+		down_write(&mm->mmap_sem);
+		locked = true;
+	}
 	free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
 	tlb_finish_mmu(&tlb, 0, -1);
 
@@ -3005,7 +3018,10 @@ void exit_mmap(struct mm_struct *mm)
 			nr_accounted += vma_pages(vma);
 		vma = remove_vma(vma);
 	}
+	mm->mmap = NULL;
 	vm_unacct_memory(nr_accounted);
+	if (locked)
+		up_write(&mm->mmap_sem);
 }
 
 /* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 9e8b4f030c1c..b1c96e1910f2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -470,40 +470,15 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 {
 	struct mmu_gather tlb;
 	struct vm_area_struct *vma;
-	bool ret = true;
-
-	/*
-	 * We have to make sure to not race with the victim exit path
-	 * and cause premature new oom victim selection:
-	 * __oom_reap_task_mm		exit_mm
-	 *   mmget_not_zero
-	 *				  mmput
-	 *				    atomic_dec_and_test
-	 *				  exit_oom_victim
-	 *				[...]
-	 *				out_of_memory
-	 *				  select_bad_process
-	 *				    # no TIF_MEMDIE task selects new victim
-	 *  unmap_page_range # frees some memory
-	 */
-	mutex_lock(&oom_lock);
 
 	if (!down_read_trylock(&mm->mmap_sem)) {
-		ret = false;
 		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
+		return false;
 	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
-		up_read(&mm->mmap_sem);
-		trace_skip_task_reaping(tsk->pid);
-		goto unlock_oom;
-	}
+	/* There is nothing to reap so bail out without signs in the log */
+	if (!mm->mmap)
+		goto unlock;
 
 	trace_start_task_reaping(tsk->pid);
 
@@ -540,18 +515,12 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_ANONPAGES)),
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
-	up_read(&mm->mmap_sem);
 
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 	trace_finish_task_reaping(tsk->pid);
-unlock_oom:
-	mutex_unlock(&oom_lock);
-	return ret;
+unlock:
+	up_read(&mm->mmap_sem);
+
+	return true;
 }
 
 #define MAX_OOM_REAP_RETRIES 10
-- 
2.13.2

-- 
Michal Hocko
SUSE Labs