linux-kernel - Re: [v8 0/4] cgroup-aware OOM killer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1709221340280.68140@chino.kir.corp.google.com>
Date:   Fri, 22 Sep 2017 13:53:47 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Johannes Weiner <hannes@...xchg.org>
cc:     Roman Gushchin <guro@...com>, linux-mm@...ck.org,
        Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Tejun Heo <tj@...nel.org>, kernel-team@...com,
        cgroups@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [v8 0/4] cgroup-aware OOM killer

On Thu, 21 Sep 2017, Johannes Weiner wrote:

> > The issue is that if you opt-in to the new feature, then you are forced to 
> > change /proc/pid/oom_score_adj of all processes attached to a cgroup that 
> > you do not want oom killed based on size to be oom disabled.
> 
> You're assuming that most people would want to influence the oom
> behavior in the first place. I think the opposite is the case: most
> people don't care as long as the OOM killer takes the intent the user
> has expressed wrt runtime containerization/grouping into account.
> 

If you do not want to influence the oom behavior, do not change 
memory.oom_priority from its default.  It's that simple.

> > The kernel provides no other remedy without oom priorities since the
> > new feature would otherwise disregard oom_score_adj.
> 
> As of v8, it respects this setting and doesn't kill min score tasks.
> 

That's the issue.  To protect a memory cgroup from being oom killed in a 
system oom condition, you need to change oom_score_adj of *all* processes 
attached to be oom disabled.  Then, you have a huge problem in memory 
cgroup oom conditions because nothing can be killed in that hierarchy 
itself.

> > The patchset compares memory cgroup size relative to sibling cgroups only, 
> > the same comparison for memory.oom_priority.  There is a guarantee 
> > provided on how cgroup size is compared in select_victim_memcg(), it 
> > hierarchically accumulates the "size" from leaf nodes up to the root memcg 
> > and then iterates the tree comparing sizes between sibling cgroups to 
> > choose a victim memcg.  That algorithm could be more elaborately described 
> > in the documentation, but we simply cannot change the implementation of 
> > select_victim_memcg() later even without oom priorities since users cannot 
> > get inconsistent results after opting into a feature between kernel 
> > versions.  I believe the selection criteria should be implemented to be 
> > deterministic, as select_victim_memcg() does, and the documentation should 
> > fully describe what the selection criteria is, and then allow the user to 
> > decide.
> 
> I wholeheartedly disagree. We have changed the behavior multiple times
> in the past. In fact, you have arguably done the most drastic changes
> to the algorithm since the OOM killer was first introduced. E.g.
> 
> 	a63d83f427fb oom: badness heuristic rewrite
> 
> And that's completely fine. Because this thing is not a resource
> management tool for userspace, it's the kernel saving itself. At best
> in a manner that's not too surprising to userspace.
> 

When I did that, I had to add /proc/pid/oom_score_adj to allow userspace 
to influence selection.  We came up with /proc/pid/oom_score_adj when 
working with kde, openssh, chromium, and udev because they cared about the 
ability to influence the decisionmaking.  I'm perfectly happy with the new 
heuristic presented in this patchset, I simply want userspace to be able 
to influence it, if it desires.  Requiring userspace to set all processes 
to be oom disabled to protect a hierarchy is totally and completely 
broken.  It livelocks the memory cgroup if it is oom itself.

> To me, your argument behind the NAK still boils down to "this doesn't
> support my highly specialized usecase." But since it doesn't prohibit
> your usecase - which isn't even supported upstream, btw - this really
> doesn't carry much weight.
> 
> I'd say if you want configurability on top of Roman's code, please
> submit patches and push the case for these in a separate effort.
> 

Roman implemented memory.oom_priority himself, it has my Tested-by, and it 
allows users who want to protect high priority memory cgroups from using 
the size based comparison for all other cgroups that we very much desire.