linux-kernel - Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170816154325.GB29131@castle.DHCP.thefacebook.com>
Date:   Wed, 16 Aug 2017 16:43:25 +0100
From:   Roman Gushchin <guro@...com>
To:     David Rientjes <rientjes@...gle.com>
CC:     <linux-mm@...ck.org>, Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Tejun Heo <tj@...nel.org>, <kernel-team@...com>,
        <cgroups@...r.kernel.org>, <linux-doc@...r.kernel.org>,
        <linux-kernel@...r.kernel.org>
Subject: Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

On Tue, Aug 15, 2017 at 02:47:10PM -0700, David Rientjes wrote:
> On Tue, 15 Aug 2017, Roman Gushchin wrote:
> 
> > > I'm curious about the decision made in this conditional and how 
> > > oom_kill_memcg_member() ignores task->signal->oom_score_adj.  It means 
> > > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it 
> > > should otherwise be disabled.
> > > 
> > > It's undocumented in the changelog, but I'm questioning whether it's the 
> > > right decision.  Doesn't it make sense to kill all tasks that are not oom 
> > > disabled, and allow the user to still protect certain processes by their 
> > > /proc/pid/oom_score_adj setting?  Otherwise, there's no way to do that 
> > > protection without a sibling memcg and its own reservation of memory.  I'm 
> > > thinking about a process that governs jobs inside the memcg and if there 
> > > is an oom kill, it wants to do logging and any cleanup necessary before 
> > > exiting itself.  It seems like a powerful combination if coupled with oom 
> > > notification.
> > 
> > Good question!
> > I think, that an ability to override any oom_score_adj value and get all tasks
> > killed is more important, than an ability to kill all processes with some
> > exceptions.
> > 
> 
> I'm disagreeing because getting all tasks killed is not necessarily 
> something that only the kernel can do.  If all processes are oom disabled, 
> that's a configuration issue done by sysadmin and the kernel should decide 
> to kill the next largest memory cgroup or lower priority memory cgroup.  
> It's not killing things like sshd that intentionally oom disable 
> themselves.
> 
> You could argue that having an oom disabled process attached to these 
> memcgs in the first place is also a configuration issue, but the problem 
> is that in cgroup v2 with a restriction on processes only being attached 
> at the leaf cgroups that there is no competition for memory in this case.  
> I must assign memory resources to that sshd, or "Activity Manager" 
> described by the cgroup v1 documentation, just to prevent it from being 
> killed.
> 
> I think the latter of what you describe, killing all processes with some 
> exceptions, is actually quite powerful.  I can guarantee that processes 
> that set themselves to oom disabled are really oom disabled and I don't 
> need to work around that in the cgroup hierarchy only because of this 
> caveat.  I can also oom disable my Activity Manger that wants to wait on 
> oom notification and collect the oom kill logs, raise notifications, and 
> perhaps restart the process that it manage.
> 
> > In your example someone still needs to look after the remaining process,
> > and kill it after some timeout, if it will not quit by itself, right?
> > 
> 
> No, it can restart the process that was oom killed; or it can be sshd and 
> I can still ssh into my machine.
> 
> > The special treatment of the -1000 value (without oom_kill_all_tasks)
> > is required only to not to break the existing setups.
> > 
> 
> I think as a general principle that allowing an oom disabled process to be 
> oom killed is incorrect and if you really do want these to be killed, then 
> (1) either your oom_score_adj is already wrong or (2) you can wait on oom 
> notification and exit.

It's natural to expect that inside a container there are their own sshd,
"activity manager" or some other stuff, which can play with oom_score_adj.
If it can override the upper cgroup-level settings, the whole delegation model
is broken.

You can think about the oom_kill_all_tasks like the panic_on_oom,
but on a cgroup level. It should _guarantee_, that in case of oom
the whole cgroup will be destroyed completely, and will not remain
in a non-consistent state.

The model you're describing is based on a trust given to these oom-unkillable
processes on system level. But we can't really trust some unknown processes
inside a cgroup that they will be able to do some useful work and finish
in a reasonable time; especially in case of a global memory shortage.
That means someone needs to look at cgroup after each OOM and check if there
are remaining tasks. If so, the whole functionality is useless.

If some complex post-oom handling is required, it should be performed
from another cgroup (protected by the lower oom_priority value).

So, for example:

root
  |
  A
 / \
B   C

B: oom_priority=10, oom_kill_all_tasks=1, contains workload
C: oom_priority=0, oom_kill_all_tasks=0, contains control stuff

If B is killed by OOM, an "activity manager" in C can be notified
and perform some actions.

Thanks!

Roman