linux-kernel - Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.1804191214130.157851@chino.kir.corp.google.com>
Date:   Thu, 19 Apr 2018 12:34:53 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Michal Hocko <mhocko@...nel.org>
cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Andrea Arcangeli <aarcange@...hat.com>,
        Roman Gushchin <guro@...com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper
 unmap

On Thu, 19 Apr 2018, Michal Hocko wrote:

> > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is 
> > entered.
> 
> Not true. munlock_vma_pages_all might take page_lock which can have
> unpredictable dependences. This is the reason why we are ruling out
> mlocked VMAs in the first place when reaping the address space.
> 

I don't find any occurrences in millions of oom kills in real-world 
scenarios where this matters.  The solution is certainly not to hold 
down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead.  If 
exit_mmap() is not making forward progress then that's a separate issue; 
that would need to be fixed in one of two ways: (1) in oom_reap_task() to 
try over a longer duration before setting MMF_OOM_SKIP itself, but that 
would have to be a long duration to allow a large unmap and page table 
free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but 
only if MMF_UNSTABLE has been set for a long period of time so we target 
another process when the oom killer has given up.

Either of those two fixes are simple to implement, I'd just like to see a 
bug report with stack traces to indicate that a victim getting stalled in 
exit_mmap() is a problem to justify the patch.

I'm trying to fix the page table corruption that is trivial to trigger on 
powerpc.  We simply cannot allow the oom reaper's unmap_page_range() to 
race with munlock_vma_pages_range(), ever.  Holding down_write on 
mm->mmap_sem otherwise needlessly over a large amount of code is riskier 
(hasn't been done or tested here), more error prone (any code change over 
this large area of code or in functions it calls are unnecessarily 
burdened by unnecessary locking), makes exit_mmap() less extensible for 
the same reason, and causes the oom reaper to give up and go set 
MMF_OOM_SKIP itself because it depends on taking down_read while the 
thread is still exiting.

> On the
> other hand your lock protocol introduces the MMF_OOM_SKIP problem I've
> mentioned above and that really worries me. The primary objective of the
> reaper is to guarantee a forward progress without relying on any
> externalities. We might kill another OOM victim but that is safer than
> lock up.
> 

I understand the concern, but it's the difference between the victim 
getting stuck in exit_mmap() and actually taking a long time to free its 
memory in exit_mmap().  I don't have evidence of the former.  If there are 
bug reports for occurrences of the oom reaper being unable to reap, it 
would be helpful to see.  The only reports about the "unable to reap" 
message was that the message itself was racy, not that a thread got stuck.  
This is more reason to not take down_write unnecessarily in the 
exit_mmap() path, because it influences an oom reaper heurstic.

> The current protocol has proven to be error prone so I really believe we
> should back off and turn it into something much simpler and build on top
> of that if needed.
> 
> So do you see any _technical_ reasons why not do [1] and have a simpler
> protocol easily backportable to stable trees?

It's not simpler per the above, and I agree with Andrea's assessment when 
this was originally implemented.  The current method is not error prone, 
it works, it just wasn't protecting enough of exit_mmap().  That's not a 
critcism of the method itself, it's a bugfix that expands its critical 
section.