linux-kernel - Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171031184411.GA641@cmpxchg.org>
Date:   Tue, 31 Oct 2017 14:44:11 -0400
From:   Johannes Weiner <hannes@...xchg.org>
To:     Shakeel Butt <shakeelb@...gle.com>
Cc:     Roman Gushchin <guro@...com>, Linux MM <linux-mm@...ck.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        David Rientjes <rientjes@...gle.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Tejun Heo <tj@...nel.org>, kernel-team@...com,
        Cgroups <cgroups@...r.kernel.org>, linux-doc@...r.kernel.org,
        LKML <linux-kernel@...r.kernel.org>
Subject: Re: [RESEND v12 3/6] mm, oom: cgroup-aware OOM killer

On Tue, Oct 31, 2017 at 10:50:43AM -0700, Shakeel Butt wrote:
> On Tue, Oct 31, 2017 at 9:40 AM, Johannes Weiner <hannes@...xchg.org> wrote:
> > On Tue, Oct 31, 2017 at 08:04:19AM -0700, Shakeel Butt wrote:
> >> > +
> >> > +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control *oc)
> >> > +{
> >> > +       struct mem_cgroup *iter;
> >> > +
> >> > +       oc->chosen_memcg = NULL;
> >> > +       oc->chosen_points = 0;
> >> > +
> >> > +       /*
> >> > +        * The oom_score is calculated for leaf memory cgroups (including
> >> > +        * the root memcg).
> >> > +        */
> >> > +       rcu_read_lock();
> >> > +       for_each_mem_cgroup_tree(iter, root) {
> >> > +               long score;
> >> > +
> >> > +               if (memcg_has_children(iter) && iter != root_mem_cgroup)
> >> > +                       continue;
> >> > +
> >>
> >> Cgroup v2 does not support charge migration between memcgs. So, there
> >> can be intermediate nodes which may contain the major charge of the
> >> processes in their leave descendents. Skipping such intermediate nodes
> >> will kind of protect such processes from oom-killer (lower on the list
> >> to be killed). Is it ok to not handle such scenario? If yes, shouldn't
> >> we document it?
> >
> > Tasks cannot be in intermediate nodes, so the only way you can end up
> > in a situation like this is to start tasks fully, let them fault in
> > their full workingset, then create child groups and move them there.
> >
> > That has attribution problems much wider than the OOM killer: any
> > local limits you would set on a leaf cgroup like this ALSO won't
> > control the memory of its tasks - as it's all sitting in the parent.
> >
> > We created the "no internal competition" rule exactly to prevent this
> > situation.
> 
> Rather than the "no internal competition" restriction I think "charge
> migration" would have resolved that situation? Also "no internal
> competition" restriction (I am assuming 'no internal competition' is
> no tasks in internal nodes, please correct me if I am wrong) has made
> "charge migration" hard to implement and thus not added in cgroup v2.
> 
> I know this is parallel discussion and excuse my ignorance, what are
> other reasons behind "no internal competition" specifically for memory
> controller?

Sorry, but this is completely off-topic.

The rationale for this decisions is in Documentation/cgroup-v2.txt.