linux-kernel - Re: can't oom-kill zap the victim's memory?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1509241359100.32488@chino.kir.corp.google.com>
Date:	Thu, 24 Sep 2015 14:15:34 -0700 (PDT)
From:	David Rientjes <rientjes@...gle.com>
To:	Michal Hocko <mhocko@...nel.org>
cc:	Oleg Nesterov <oleg@...hat.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Kyle Walker <kwalker@...hat.com>,
	Christoph Lameter <cl@...ux.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Vladimir Davydov <vdavydov@...allels.com>,
	linux-mm <linux-mm@...ck.org>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	Stanislav Kozina <skozina@...hat.com>,
	Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Subject: Re: can't oom-kill zap the victim's memory?

On Wed, 23 Sep 2015, Michal Hocko wrote:

> I am still not sure how you want to implement that kernel thread but I
> am quite skeptical it would be very much useful because all the current
> allocations which end up in the OOM killer path cannot simply back off
> and drop the locks with the current allocator semantic.  So they will
> be sitting on top of unknown pile of locks whether you do an additional
> reclaim (unmap the anon memory) in the direct OOM context or looping
> in the allocator and waiting for kthread/workqueue to do its work. The
> only argument that I can see is the stack usage but I haven't seen stack
> overflows in the OOM path AFAIR.
> 

Which locks are you specifically interested in?  We have already discussed 
the usefulness of killing all threads on the system sharing the same ->mm, 
meaning all threads that are either holding or want to hold mm->mmap_sem 
will be able to allocate into memory reserves.  Any allocator holding 
down_write(&mm->mmap_sem) should be able to allocate and drop its lock.  
(Are you concerned about MAP_POPULATE?)

> > Finally. Whatever we do, we need to change oom_kill_process() first,
> > and I think we should do this regardless. The "Kill all user processes
> > sharing victim->mm" logic looks wrong and suboptimal/overcomplicated.
> > I'll try to make some patches tomorrow if I have time...
> 
> That would be appreciated. I do not like that part either. At least we
> shouldn't go over the whole list when we have a good chance that the mm
> is not shared with other processes.
> 

Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem, 
it's the reason the code exists.  Any optimizations to that is certainly 
welcome, but we definitely need to send SIGKILL to all threads sharing the 
mm to make forward progress, otherwise we are going back to pre-2008 
livelocks.

> Yes I am not really sure why oom_score_adj is not per-mm and we are
> doing that per signal struct to be honest. It doesn't make much sense as
> the mm_struct is the primary source of information for the oom victim
> selection. And the fact that mm might be shared withtout sharing signals
> make it double the reason to have it in mm.
> 
> It seems David has already tried that 2ff05b2b4eac ("oom: move oom_adj
> value from task_struct to mm_struct") but it was later reverted by
> 0753ba01e126 ("mm: revert "oom: move oom_adj value""). I do not agree
> with the reasoning there because vfork is documented to have undefined
> behavior
> "
>        if the process created by vfork() either modifies any data other
>        than a variable of type pid_t used to store the return value
>        from vfork(), or returns from the function in which vfork() was
>        called, or calls any other function before successfully calling
>        _exit(2) or one of the exec(3) family of functions.
> "
> Maybe we can revisit this... It would make the whole semantic much more
> straightforward. The current situation when you kill a task which might
> share the mm with OOM unkillable task is clearly suboptimal and
> confusing.
> 

How do you reconcile this with commit 28b83c5193e7 ("oom: move oom_adj 
value from task_struct to signal_struct")?  We also must appreciate the 
real-world usecase for an oom disabled process doing fork(), setting 
/proc/child/oom_score_adj to non-disabled, and exec().
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/