linux-kernel - Re: [PATCH 1/2] mm, oom: introduce oom reaper

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 2 Feb 2016 14:51:41 -0800 (PST)
From:	David Rientjes <rientjes@...gle.com>
To:	Michal Hocko <mhocko@...nel.org>
cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Mel Gorman <mgorman@...e.de>,
	Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Oleg Nesterov <oleg@...hat.com>,
	Hugh Dickins <hughd@...gle.com>,
	Andrea Argangeli <andrea@...nel.org>,
	Rik van Riel <riel@...hat.com>, linux-mm@...ck.org,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] mm, oom: introduce oom reaper

On Tue, 2 Feb 2016, Michal Hocko wrote:

> > Not exclude them, but I would have expected untrack_pfn().
> 
> My understanding is that vm_normal_page will do the right thing for
> those mappings - especially for CoW VM_PFNMAP which are normal pages
> AFAIU. Wrt. to untrack_pfn I was relying that the victim will eventually
> enter exit_mmap and do the remaining house keepining. Maybe I am missing
> something but untrack_pfn shouldn't lead to releasing a considerable
> amount of memory. So is this really necessary or we can wait for
> exit_mmap?
> 

I think if you move the code to mm/memory.c that you may find a greater 
opportunity to share code with the implementations there and this will 
take care of itself :)  I'm concerned about this also from a 
maintainability standpoint where a future patch might modify one 
implementation while forgetting about the other.  I think there's a great 
opportunity here for a really clean and shiny interfance that doesn't 
introduce any more complexity.

> > The problem is that this is racy and quite easy to trigger: imagine if 
> > __oom_reap_vmas() finds mm->mm_users == 0, because the memory of the 
> > victim has been freed, and then another system-wide oom condition occurs 
> > before the oom reaper's mm_to_reap has been set to NULL.
> 
> Yes I realize this is potentially racy. I just didn't consider the race
> important enough to justify task queuing in the first submission. Tetsuo
> was pushing for this already and I tried to push back for simplicity in
> the first submission. But ohh well... I will queue up a patch to do this
> on top. I plan to repost the full patchset shortly.
> 

Ok, thanks!  It should probably be dropped from -mm in the interim until 
it has some acked-by's, but I think those will come pretty quickly once 
it's refreshed if all of this is handled.

> > In this case, the oom reaper has ignored the next victim and doesn't do 
> > anything; the simple race has prevented it from zapping memory and does 
> > not reduce the livelock probability.
> > 
> > This can be solved either by queueing mm's to reap or involving the oom 
> > reaper into the oom killer synchronization itself.
> 
> as we have already discussed previously oom reaper is really tricky to
> be called from the direct OOM context. I will go with queuing. 
>  

Hmm, I wasn't referring to oom context: it would be possible without 
queueing with an mm_to_reap_lock (or cmpxchg) in the oom reaper and when 
the final mmput() is done.  Set it when the mm is ready for reaping, clear 
it when the mm is being destroyed, and test it before calling the oom 
killer.  I think we'd want to defer the oom killer until potential reaping 
could be done anyway and I don't anticipate an issue where oom_reaper 
fails to schedule.

> > I'm baffled by any reference to "memcg oom heavy loads", I don't 
> > understand this paragraph, sorry.  If a memcg is oom, we shouldn't be
> > disrupting the global runqueue by running oom_reaper at a high priority.  
> > The disruption itself is not only in first wakeup but also in how long the 
> > reaper can run and when it is rescheduled: for a lot of memory this is 
> > potentially long.  The reaper is best-effort, as the changelog indicates, 
> > and we shouldn't have a reliance on this high priority: oom kill exiting 
> > can't possibly be expected to be immediate.  This high priority should be 
> > removed so memcg oom conditions are isolated and don't affect other loads.
> 
> If this is a concern then I would be tempted to simply disable oom
> reaper for memcg oom altogether. For me it is much more important that
> the reaper, even though a best effort, is guaranteed to schedule if
> something goes terribly wrong on the machine.
> 

I don't believe the higher priority guarantees it is able to schedule any 
more than it was guaranteed to schedule before.  It will run, but it won't 
preempt other innocent processes in disjoint memcgs or cpusets.  It's not 
only a memcg issue, but it also impacts disjoint cpuset mems and mempolicy 
nodemasks.  I think it would be disappointing to leave those out.  I think 
the higher priority should simply be removed in terms of fairness.

Other than these issues, I don't see any reason why a refreshed series 
wouldn't be immediately acked.  Thanks very much for continuing to work on 
this!