linux-kernel - Re: can't oom-kill zap the victim's memory?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1509281512330.13657@chino.kir.corp.google.com>
Date:	Mon, 28 Sep 2015 15:24:06 -0700 (PDT)
From:	David Rientjes <rientjes@...gle.com>
To:	Michal Hocko <mhocko@...nel.org>
cc:	Oleg Nesterov <oleg@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Kyle Walker <kwalker@...hat.com>,
	Christoph Lameter <cl@...ux.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Vladimir Davydov <vdavydov@...allels.com>,
	linux-mm <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Stanislav Kozina <skozina@...hat.com>,
	Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Subject: Re: can't oom-kill zap the victim's memory?

On Fri, 25 Sep 2015, Michal Hocko wrote:

> > > I am still not sure how you want to implement that kernel thread but I
> > > am quite skeptical it would be very much useful because all the current
> > > allocations which end up in the OOM killer path cannot simply back off
> > > and drop the locks with the current allocator semantic.  So they will
> > > be sitting on top of unknown pile of locks whether you do an additional
> > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > only argument that I can see is the stack usage but I haven't seen stack
> > > overflows in the OOM path AFAIR.
> > > 
> > 
> > Which locks are you specifically interested in?
> 
> Any locks they were holding before they entered the page allocator (e.g.
> i_mutex is the easiest one to trigger from the userspace but mmap_sem
> might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> mmap_sem held for write). Those would be locked until the page allocator
> returns, which with the current semantic might be _never_.
> 

I agree that i_mutex seems to be one of the most common offenders.  
However, I'm not sure I understand why holding it while trying to allocate 
infinitely for an order-0 allocation is problematic wrt the proposed 
kthread.  The kthread itself need only take mmap_sem for read.  If all 
threads sharing the mm with a victim have been SIGKILL'd, they should get 
TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
drop mmap_sem.  We must ensure that any holder of mmap_sem cannot quickly 
deplete memory reserves without properly checking for 
fatal_signal_pending().

> > We have already discussed 
> > the usefulness of killing all threads on the system sharing the same ->mm, 
> > meaning all threads that are either holding or want to hold mm->mmap_sem 
> > will be able to allocate into memory reserves.  Any allocator holding 
> > down_write(&mm->mmap_sem) should be able to allocate and drop its lock.  
> > (Are you concerned about MAP_POPULATE?)
> 
> I am not sure I understand. We would have to fail the request in order
> the context which requested the memory could drop the lock. Are we
> talking about the same thing here?
> 

Not fail the request, they should be able to allocate from memory reserves 
when TIF_MEMDIE gets set.  This would require that threads is all gfp 
contexts are able to get TIF_MEMDIE set without an explicit call to 
out_of_memory() for !__GFP_FS.

> > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
> > it's the reason the code exists.  Any optimizations to that is certainly 
> > welcome, but we definitely need to send SIGKILL to all threads sharing the 
> > mm to make forward progress, otherwise we are going back to pre-2008 
> > livelocks.
> 
> Yes but mm is not shared between processes most of the time. CLONE_VM
> without CLONE_THREAD is more a corner case yet we have to crawl all the
> task_structs for _each_ OOM killer invocation. Yes this is an extreme
> slow path but still might take quite some unnecessarily time.
>  

It must solve the issue you describe, killing other processes that share 
the ->mm, otherwise we have mm->mmap_sem livelock.  We are not concerned 
about iterating over all task_structs in the oom killer as a painpoint, 
such users should already be using oom_kill_allocating_task which is why 
it was introduced.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/