linux-kernel - Re: [v8 0/4] cgroup-aware OOM killer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171002124712.GA17638@castle.DHCP.thefacebook.com>
Date:   Mon, 2 Oct 2017 13:47:12 +0100
From:   Roman Gushchin <guro@...com>
To:     Michal Hocko <mhocko@...nel.org>
CC:     Shakeel Butt <shakeelb@...gle.com>,
        Tim Hockin <thockin@...kin.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Tejun Heo <tj@...nel.org>, <kernel-team@...com>,
        David Rientjes <rientjes@...gle.com>,
        Linux MM <linux-mm@...ck.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Cgroups <cgroups@...r.kernel.org>, <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [v8 0/4] cgroup-aware OOM killer

On Mon, Oct 02, 2017 at 02:24:34PM +0200, Michal Hocko wrote:
> On Sun 01-10-17 16:29:48, Shakeel Butt wrote:
> > >
> > > Going back to Michal's example, say the user configured the following:
> > >
> > >        root
> > >       /    \
> > >      A      D
> > >     / \
> > >    B   C
> > >
> > > A global OOM event happens and we find this:
> > > - A > D
> > > - B, C, D are oomgroups
> > >
> > > What the user is telling us is that B, C, and D are compound memory
> > > consumers. They cannot be divided into their task parts from a memory
> > > point of view.
> > >
> > > However, the user doesn't say the same for A: the A subtree summarizes
> > > and controls aggregate consumption of B and C, but without groupoom
> > > set on A, the user says that A is in fact divisible into independent
> > > memory consumers B and C.
> > >
> > > If we don't have to kill all of A, but we'd have to kill all of D,
> > > does it make sense to compare the two?
> > >
> > 
> > I think Tim has given very clear explanation why comparing A & D makes
> > perfect sense. However I think the above example, a single user system
> > where a user has designed and created the whole hierarchy and then
> > attaches different jobs/applications to different nodes in this
> > hierarchy, is also a valid scenario.
> 
> Yes and nobody is disputing that, really. I guess the main disconnect
> here is that different people want to have more detailed control over
> the victim selection while the patchset tries to handle the most
> simplistic scenario when a no userspace control over the selection is
> required. And I would claim that this will be a last majority of setups
> and we should address it first.
> 
> A more fine grained control needs some more thinking to come up with a
> sensible and long term sustainable API. Just look back and see at the
> oom_score_adj story and how it ended up unusable in the end (well apart
> from never/always kill corner cases). Let's not repeat that again now.
> 
> I strongly believe that we can come up with something - be it priority
> based, BFP based or module based selection. But let's start simple with
> the most basic scenario first with a most sensible semantic implemented.

Totally agree.

> I believe the latest version (v9) looks sensible from the semantic point
> of view and we should focus on making it into a mergeable shape.

The only thing is that after some additional thinking I don't think anymore
that implicit propagation of oom_group is a good idea.

Let me explain: assume we have memcg A with memory.max and memory.oom_group
set, and nested memcg A/B with memory.max set. Let's imagine we have an OOM
event if A/B. What is an expected system behavior?
We have OOM scoped to A/B, and any action should be also scoped to A/B.
We really shouldn't touch processes which are not belonging to A/B.
That means we should either kill the biggest process in A/B, either all
processes in A/B. It's natural to make A/B/memory.oom_group responsible
for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
It really makes no sense, and makes oom_group knob really hard to describe.

Also, after some off-list discussion, we've realized that memory.oom_knob
should be delegatable. The workload should have control over it to express
dependency between processes.

Thanks!