linux-kernel - Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <201706161927.EII04611.VOFFMLJOOFHQSt@I-love.SAKURA.ne.jp>
Date:   Fri, 16 Jun 2017 19:27:19 +0900
From:   Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:     mhocko@...nel.org
Cc:     rientjes@...gle.com, akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: Re: [patch] mm, oom: prevent additional oom kills before memory is freed

Michal Hocko wrote:
> On Fri 16-06-17 09:54:34, Tetsuo Handa wrote:
> [...]
> > And the patch you proposed is broken.
> 
> Thanks for your testing!
>  
> > ----------
> > [  161.846202] Out of memory: Kill process 6331 (a.out) score 999 or sacrifice child
> > [  161.850327] Killed process 6331 (a.out) total-vm:4172kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
> > [  161.858503] ------------[ cut here ]------------
> > [  161.861512] kernel BUG at mm/memory.c:1381!
> 
> BUG_ON(addr >= end) suggests our vma has trimmed. I guess I see what is
> going on here.
> __oom_reap_task_mm				exit_mmap
> 						  free_pgtables
> 						  up_write(mm->mmap_sem)
>   down_read_trylock(&mm->mmap_sem)
>   						  remove_vma
>     unmap_page_range
> 
> So we need to extend the mmap_sem coverage. See the updated diff (not
> the full proper patch yet).

That diff is still wrong. We need to prevent __oom_reap_task_mm() from calling
unmap_page_range() when __mmput() already called exit_mm(), by setting/checking
MMF_OOM_SKIP like shown below.

diff --git a/kernel/fork.c b/kernel/fork.c
index e53770d..5ef715c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -902,6 +902,11 @@ static inline void __mmput(struct mm_struct *mm)
 	exit_aio(mm);
 	ksm_exit(mm);
 	khugepaged_exit(mm); /* must run before exit_mmap */
+	/*
+	 * oom reaper might race with exit_mmap so make sure we won't free
+	 * page tables under its feet
+	 */
+	down_write(&mm->mmap_sem);
 	exit_mmap(mm);
 	mm_put_huge_zero_page(mm);
 	set_mm_exe_file(mm, NULL);
@@ -913,6 +918,7 @@ static inline void __mmput(struct mm_struct *mm)
 	if (mm->binfmt)
 		module_put(mm->binfmt->module);
 	set_bit(MMF_OOM_SKIP, &mm->flags);
+	up_write(&mm->mmap_sem);
 	mmdrop(mm);
 }
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 04c9143..98cca19 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -493,12 +493,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 		goto unlock_oom;
 	}
 
-	/*
-	 * increase mm_users only after we know we will reap something so
-	 * that the mmput_async is called only when we have reaped something
-	 * and delayed __mmput doesn't matter that much
-	 */
-	if (!mmget_not_zero(mm)) {
+	if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
 		up_read(&mm->mmap_sem);
 		goto unlock_oom;
 	}
@@ -537,13 +532,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
 			K(get_mm_counter(mm, MM_FILEPAGES)),
 			K(get_mm_counter(mm, MM_SHMEMPAGES)));
 	up_read(&mm->mmap_sem);
-
-	/*
-	 * Drop our reference but make sure the mmput slow path is called from a
-	 * different context because we shouldn't risk we get stuck there and
-	 * put the oom_reaper out of the way.
-	 */
-	mmput_async(mm);
 unlock_oom:
 	mutex_unlock(&oom_lock);
 	return ret;



> 
> > Please carefully consider the reason why there is VM_BUG_ON() in __mmput(),
> > and clarify in your patch that what are possible side effects of racing
> > uprobe_clear_state()/exit_aio()/ksm_exit()/exit_mmap() etc. with
> > __oom_reap_task_mm()
> 
> Yes that definitely needs to be checked. We basically rely on the racing
> part of the __mmput to not modify the address space. oom_reaper doesn't
> touch any vma state except it unmaps pages which can run in parallel.
> exit_aio->kill_ioctx seemingly does vm_munmap but it a) uses the
> mmap_sem for write and b) it doesn't actually unmap because exit_aio
> does ctx->mmap_size = 0. {ksm,khugepaged}_exit just do some houskeeping
> which is not modifying the address space. I hope I will find some more
> time to work on this next week. Additional test would be highly
> appreciated of course.

Since the OOM reaper does not reap hugepages, khugepaged_exit() part could be
safe. But ksm_exit() part might interfere. If it is guaranteed to be safe,
what will go wrong if we move uprobe_clear_state()/exit_aio()/ksm_exit() etc.
to just before mmdrop() (i.e. after setting MMF_OOM_SKIP) ?