[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180420124044.GA17484@dhcp22.suse.cz>
Date: Fri, 20 Apr 2018 14:40:44 +0200
From: Michal Hocko <mhocko@...nel.org>
To: David Rientjes <rientjes@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
Andrea Arcangeli <aarcange@...hat.com>,
Roman Gushchin <guro@...com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap
On Fri 20-04-18 10:23:49, Michal Hocko wrote:
> On Thu 19-04-18 12:34:53, David Rientjes wrote:
[...]
> > I understand the concern, but it's the difference between the victim
> > getting stuck in exit_mmap() and actually taking a long time to free its
> > memory in exit_mmap(). I don't have evidence of the former.
>
> I do not really want to repeat myself. The primary purpose of the oom
> reaper is to provide a _guarantee_ of the forward progress. I do not
> care whether there is any evidences. All I know that lock_page has
> plethora of different dependencies and we cannot clearly state this is
> safe so we _must not_ depend on it when setting MMF_OOM_SKIP.
>
> The way how the oom path was fragile and lockup prone based on
> optimistic assumptions shouldn't be repeated.
>
> That being said, I haven't heard any actual technical argument about why
> locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP
> dependency on the page_lock really concerns me so
>
> Nacked-by: Michal Hocko <mhocko@...e.com>
>
> If you want to keep the current locking protocol then you really have to
> make sure that the oom reaper will set MMF_OOM_SKIP when racing with
> exit_mmap.
So here is my suggestion for the fix.
>From 37ab85d619f508ceaf829e57648a3d986c6d8bfc Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@...e.com>
Date: Fri, 20 Apr 2018 13:53:08 +0200
Subject: [PATCH] oom: mm, oom: fix concurrent munlock and oom reaper unmap
David has noticed that our current assumption that the oom reaper can
race with exit_mmap is not correct. munlock_vma_pages_all() depends on
clearing VM_LOCKED from vm_flags before actually doing the munlock to
determine if any other vmas are locking the same memory, the check for
VM_LOCKED in the oom reaper is racy.
This is especially noticeable on architectures such as powerpc where
clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
is zapped by the oom reaper during follow_page_mask() after the check for
pmd_none() is bypassed, this ends up deferencing a NULL ptl.
Fix this by taking the exclusive mmap_sem in exit_mmap while tearing
down the address space. Once that is done MMF_OOM_SKIP is set so that
__oom_reap_task_mm can back off if it manages to take the read lock
finally.
Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
Cc: stable
Signed-off-by: Michal Hocko <mhocko@...e.com>
---
mm/mmap.c | 36 ++++++++++++++++++------------------
mm/oom_kill.c | 2 +-
2 files changed, 19 insertions(+), 19 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index faf85699f1a1..216efa6d9f61 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3004,10 +3004,21 @@ void exit_mmap(struct mm_struct *mm)
struct mmu_gather tlb;
struct vm_area_struct *vma;
unsigned long nr_accounted = 0;
+ bool locked = false;
/* mm's last user has gone, and its about to be pulled down */
mmu_notifier_release(mm);
+ /*
+ * The mm is not accessible for anybody except for the oom reaper
+ * which cannot race with munlocking so make sure we exclude the
+ * two.
+ */
+ if (unlikely(mm_is_oom_victim(mm))) {
+ down_write(&mm->mmap_sem);
+ locked = true;
+ }
+
if (mm->locked_vm) {
vma = mm->mmap;
while (vma) {
@@ -3021,7 +3032,7 @@ void exit_mmap(struct mm_struct *mm)
vma = mm->mmap;
if (!vma) /* Can happen if dup_mmap() received an OOM */
- return;
+ goto out_unlock;
lru_add_drain();
flush_cache_mm(mm);
@@ -3030,23 +3041,6 @@ void exit_mmap(struct mm_struct *mm)
/* Use -1 here to ensure all VMAs in the mm are unmapped */
unmap_vmas(&tlb, vma, 0, -1);
- if (unlikely(mm_is_oom_victim(mm))) {
- /*
- * Wait for oom_reap_task() to stop working on this
- * mm. Because MMF_OOM_SKIP is already set before
- * calling down_read(), oom_reap_task() will not run
- * on this "mm" post up_write().
- *
- * mm_is_oom_victim() cannot be set from under us
- * either because victim->mm is already set to NULL
- * under task_lock before calling mmput and oom_mm is
- * set not NULL by the OOM killer only if victim->mm
- * is found not NULL while holding the task_lock.
- */
- set_bit(MMF_OOM_SKIP, &mm->flags);
- down_write(&mm->mmap_sem);
- up_write(&mm->mmap_sem);
- }
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
@@ -3060,6 +3054,12 @@ void exit_mmap(struct mm_struct *mm)
vma = remove_vma(vma);
}
vm_unacct_memory(nr_accounted);
+
+out_unlock:
+ if (unlikely(locked)) {
+ set_bit(MMF_OOM_SKIP, &mm->flags);
+ up_write(&mm->mmap_sem);
+ }
}
/* Insert vm structure into process list sorted by address
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dfd370526909..94d26ef2c3c7 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -524,7 +524,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
* work on the mm anymore. The check for MMF_OOM_SKIP must run
* under mmap_sem for reading because it serializes against the
- * down_write();up_write() cycle in exit_mmap().
+ * exit_mmap().
*/
if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
up_read(&mm->mmap_sem);
--
2.16.3
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists