linux-kernel - Re: [PATCH] copy over oom

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.00.0907171559290.30600@chino.kir.corp.google.com>
Date:	Fri, 17 Jul 2009 16:12:48 -0700 (PDT)
From:	David Rientjes <rientjes@...gle.com>
To:	Paul Menage <menage@...gle.com>
cc:	Rik van Riel <riel@...hat.com>, linux-kernel@...r.kernel.org,
	akpm@...ux-foundation.org, mel@....ul.ie, npiggin@...e.de
Subject: Re: [PATCH] copy over oom_adj value at fork time

On Fri, 17 Jul 2009, Paul Menage wrote:

> > The only way to workaround that is by using the highest oom_adj user for
> > the mm_struct from the array in reporting /proc/pid/oom_score, as well.
> 
> That sounds fine to me.
> 

It's better than what we did before my patchset, which would be report an 
erroneous oom_score if another thread attached to the same mm_struct had a 
less immune (more positive) oom_adj score.

With my patches in 2.6.31-rc3, there is no ambiguity and /proc/pid/oom_adj 
and /proc/pid/oom_score are consistent for all threads.  Because of that, 
they now accurately reflect in all circumstances how the oom killer 
chooses a victim, which should be very important to users who are tuning 
oom_adj values in the first place.

> > But that would lead to /proc/pid/oom_adj not affecting oom_score at all,
> > which isn't consistent.
> 
> Isn't consistent with what? It's perfectly consistent with saying "the
> oom_score of a task is based on the highest oom_adj value of any task
> sharing the same mm". Admittedly it's not 100% consistent with the old
> semantics, but I'm having trouble imagining a scenario where someone
> was relying on the changed semantics.
> 

I'm sure that most users who tune /proc/pid/oom_adj for a group of tasks, 
especially when tuning them in terms of killing priority in oom conditions 
as opposed to simply using OOM_DISABLE, rely on /proc/pid/oom_score to 
determine when a task should be killed because it is using an egregious 
amount of memory.  That's why oom_score is exported to userspace; 
otherwise, tuning oom_adj for anything other than OOM_DISABLE would be 
difficult.

> But taking a completely different approach, is there a reason that we
> couldn't have just moved the do_each_thread() check for OOM_DISABLED
> out of oom_kill_task() and into select_bad_process() at the point
> where we've decided that the thread in question is a better victim
> than the current victim? That would fix the OOM livelock while
> allowing us to keep exactly the same oom_adj/oom_score semantics as in
> previous kernels.
> 

My patches don't merely address threads that have an oom_adj value of 
OOM_DISABLE while others sharing the same memory do not, they address a 
consistency issue whereas those threads may all share memory but have 
radically different oom_adj values: that means that only the highest 
oom_adj value amongst them is actually used by the oom killer and all 
other oom_adj values set by the user for those threads are wrong.

> > The inheritance issue should be fixed with Rik's patch with the exception
> > of vfork -> change /proc/pid-of-child/oom_adj -> execve.  If scripts were
> > written to do that with the old behavior, they'll have to adjust to change
> > oom_adj _after_ the execve
> 
> Think about what you're suggesting here - execve() replaces your code
> with whatever you're execing, so unless that code is also written to
> handle oom_adj (which for something like a generic job scheduler, the
> exec'd code is unlikely to do) you're stuck.
> 

The burden to adjust /proc/pid/oom_adj has always been on the admin, the 
only thing the kernel needs to provide is an inheritance property (as 
added with Rik's patch) to have any consistency with a priority based 
killing approach, i.e. forked children should have the same oom_adj value 
to begin with to not be more immune or a higher priority to kill than the 
parent unless explicitly changed by the admin.