linux-kernel - Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer mount option with tunable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Fri, 26 Jan 2018 16:17:35 -0800
From:   Andrew Morton <akpm@...ux-foundation.org>
To:     David Rientjes <rientjes@...gle.com>
Cc:     Roman Gushchin <guro@...com>, Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Tejun Heo <tj@...nel.org>, kernel-team@...com,
        cgroups@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [patch -mm v2 2/3] mm, memcg: replace cgroup aware oom killer
 mount option with tunable

On Fri, 26 Jan 2018 14:52:59 -0800 (PST) David Rientjes <rientjes@...gle.com> wrote:

> On Fri, 26 Jan 2018, Andrew Morton wrote:
> 
> > > -ECONFUSED.  We want to have a mount option that has the sole purpose of 
> > > doing echo cgroup > /mnt/cgroup/memory.oom_policy?
> > 
> > Approximately.  Let me put it another way: can we modify your patchset
> > so that the mount option remains, and continues to have a sufficiently
> > same effect?  For backward compatibility.
> > 
> 
> The mount option would exist solely to set the oom policy of the root mem 
> cgroup, it would lose its effect of mandating that policy for any subtree 
> since it would become configurable by the user if delegated.

Why can't we propagate the mount option into the subtrees?

If the user then alters that behaviour with new added-by-David tunables
then fine, that's still backward compatible.

> Let me put it another way: if the cgroup aware oom killer is merged for 
> 4.16 without this patchset, userspace can reasonably infer the oom policy 
> from checking how cgroups were mounted.  If my followup patchset were 
> merged for 4.17, that's invalid and it becomes dependent on kernel 
> version: it could have the "groupoom" mount option but configured through 
> the root mem cgroup's memory.oom_policy to not be cgroup aware at all.

That concern seems unreasonable to me.  Is an application *really*
going to peek at the mount options to figure out what its present oom
policy is?  Well, maybe.  But that's a pretty dopey thing to do and I
wouldn't lose much sleep over breaking any such application in the very
unlikely case that such a thing was developed in that two-month window.

If that's really a concern then let's add (to Roman's patchset) a
proper interface for an application to query its own oom policy.

> That inconsistency, to me, is unfair to burden userspace with.
> 
> > > This, and fixes to fairly compare the root mem cgroup with leaf mem 
> > > cgroups, are essential before the feature is merged otherwise it yields 
> > > wildly unpredictable (and unexpected, since its interaction with 
> > > oom_score_adj isn't documented) results as I already demonstrated where 
> > > cgroups with 1GB of usage are killed instead of 6GB workers outside of 
> > > that subtree.
> > 
> > OK, so Roman's new feature is incomplete: it satisfies some use cases
> > but not others.  And we kinda have a plan to address the other use
> > cases in the future.
> > 
> 
> Those use cases are also undocumented such that the user doesn't know the 
> behavior they are opting into.  Nowhere in the patchset does it mention 
> anything about oom_score_adj other than being oom disabled.  It doesn't 
> mention that a per-process tunable now depends strictly on whether it is 
> attached to root or not.  It specifies a fair comparison between the root 
> mem cgroup and leaf mem cgroups, which is obviously incorrect by the 
> implementation itself.  So I'm not sure the user would know which use 
> cases it is valid for, which is why I've been trying to make it generally 
> purposeful and documented.

Documentation patches are nice.  We can cc:stable them too, so no huge
hurry.

> > There's nothing wrong with that!  As long as we don't break existing
> > setups while evolving the feature.  How do we do that?
> > 
> 
> We'd break the setups that actually configure their cgroups and processes 
> to abide by the current implementation since we'd need to discount 
> oom_score_adj from the the root mem cgroup usage to fix it.

Am having trouble understanding that.  Expand, please?

Can we address this (and other such) issues in the (interim)
documentation?