linux-kernel - Re: [patch -mm 3/4] mm, memcg: replace memory.oom

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1801221420120.16871@chino.kir.corp.google.com>
Date:   Mon, 22 Jan 2018 14:34:39 -0800 (PST)
From:   David Rientjes <rientjes@...gle.com>
To:     Tejun Heo <tj@...nel.org>
cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Roman Gushchin <guro@...com>, Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        kernel-team@...com, cgroups@...r.kernel.org,
        linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org
Subject: Re: [patch -mm 3/4] mm, memcg: replace memory.oom_group with policy
 tunable

On Sat, 20 Jan 2018, Tejun Heo wrote:

> > Hearing no response, I'll implement this as a separate tunable in a v2 
> > series assuming there are no better ideas proposed before next week.  One 
> > of the nice things about a separate tunable is that an admin can control 
> > the overall policy and they can delegate the mechanism (killall vs one 
> > process) to a user subtree.  I agree with your earlier point that killall 
> > vs one process is a property of the workload and is better defined 
> > separately.
> 
> If I understood your arguments correctly, the reasons that you thought
> your selectdion policy changes must go together with Roman's victim
> action were two-fold.
> 
> 1. You didn't want a separate knob for group oom behavior and wanted
>    it to be combined with selection policy.  I'm glad that you now
>    recognize that this would be the wrong design choice.
> 

The memory.oom_action (or mechanism) file that I've proposed is different 
than memory.oom_group: we want to provide a non-binary tunable to specify 
what action that oom killer should effect.  That could be to kill all 
processes in the subtree, similar to memory.oom_group, the local cgroup, 
or a different mechanism.  I could propose the patchset backwards, if 
necessary, because memory.oom_group is currently built upon a broken 
selection heuristic.  In other words, I could propose a memory.oom_action 
that can specify two different mechanisms that would be useful outside of 
any different selection function.  However, since the mechanism is built 
on top of the cgroup aware oom killer's policy, we can't merge it 
currently without the broken logic.

> 2. The current selection policy may be exploited by delegatee and
>    strictly hierarchical seleciton should be available.  We can debate
>    the pros and cons of different heuristics; however, to me, the
>    followings are clear.
> 
>    * Strictly hierarchical approach can't replace the current policy.
>      It doesn't work well for a lot of use cases.
> 

-ECONFUSED.  I haven't proposed any strict hierarchical approach here, 
it's configurable by the user.

>    * OOM victim selection policy has always been subject to changes
>      and improvements.
> 

That's fine, but the selection policy introduced by any cgroup aware oom 
killer is being specified now and is pretty clear cut: compare the usage 
of cgroups equally based on certain criteria and choose the largest.  I 
don't see how that could be changed or improved once the end user starts 
using it, the heuristic has become a policy.  A single cgroup aware policy 
doesn't work, if you don't have localized, per-cgroup control you have 
Michal's /admins and /students example; if you don't have hierarchical, 
subtree control you have my example of users intentionally/unintentionally 
evading the selection logic based on using cgroups.  The policy needs to 
be defined for subtrees.  There hasn't been any objection to that, so 
introducing functionality that adds a completely unnecessary and broken 
mount option and forces users to configure cgroups in a certain manner 
that no longer exists with subtree control doesn't seem helpful.

> I don't see any blocker here.  The issue you're raising can and should
> be handled separately.
> 

It can't, because the current patchset locks the system into a single 
selection criteria that is unnecessary and the mount option would become a 
no-op after the policy per subtree becomes configurable by the user as 
part of the hierarchy itself.

> Here, whether a workload can survive being killed piece-wise or not is
> an inherent property of the workload and a pretty binary one at that.
> I'm not necessarily against changing it to take string inputs but
> don't see rationales for doing so yet.
> 

We don't need the unnecessary level in the cgroup hierarchy that enables 
memory.oom_group, as proposed, and serves no other purpose.  It's 
perfectly valid for subtrees to run user executors whereas a "killall" 
mechanism is valid for some workloads and not others.  We do not need 
ancestor cgroups locking that decision into place.