linux-kernel - Re: [v3 2/6] mm, oom: cgroup-aware OOM killer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1707101547010.116811@chino.kir.corp.google.com>
Date:   Mon, 10 Jul 2017 16:05:49 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Roman Gushchin <guro@...com>
cc:     linux-mm@...ck.org, Michal Hocko <mhocko@...nel.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Tejun Heo <tj@...nel.org>, kernel-team@...com,
        cgroups@...r.kernel.org, linux-doc@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [v3 2/6] mm, oom: cgroup-aware OOM killer

On Wed, 21 Jun 2017, Roman Gushchin wrote:

> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers. There are two main issues:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 

Yes, the original motivation was to limit killing to a single process, if 
possible.  To do that, we kill the process with the largest rss to free 
the most memory and rely on the user to configure /proc/pid/oom_score_adj 
if something else should be prioritized.

With containerization and overcommit of system memory, we concur that 
killing the single largest process isn't always preferable and neglects 
the priority of its memcg.  Your motivation seems to be to provide 
fairness between one memcg with a large process and one memcg with a large 
number of small processes; I'm curious if you are concerned about the 
priority of a memcg hierarchy (how important that "job" is) or whether you 
are strictly concerned with "largeness" of memcgs relative to each other.

> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much more safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 

We agree.

> 3) Per-process oom_score_adj affects global OOM, so it's a breache
> in the isolation.
> 

This should only be a consequence of overcommiting memcgs at the top level 
so the system oom killer is actually ever invoked, otherwise per-process 
oom_score_adj works well for memcg oom killing.

> To address these issues, cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it tries to find the biggest memory consumer,
> and free memory by killing corresponding task(s). The difference
> the "traditional" OOM killer is that it can treat memory cgroups
> as memory consumers as well as single processes.
> 
> By default, it will look for the biggest leaf cgroup, and kill
> the largest task inside.
> 
> But a user can change this behavior by enabling the per-cgroup
> oom_kill_all_tasks option. If set, it causes the OOM killer treat
> the whole cgroup as an indivisible memory consumer. In case if it's
> selected as on OOM victim, all belonging tasks will be killed.
> 

These are two different things, right?  We can adjust how the system oom 
killer chooses victims when memcg hierarchies overcommit the system to not 
strictly prefer the single process with the largest rss without killing 
everything attached to the memcg.

Separately: do you not intend to support memcg priorities at all, but 
rather strictly consider the "largeness" of a memcg versus other memcgs?

In our methodology, each memcg is assigned a priority value and the 
iteration of the hierarchy simply compares and visits the memcg with the 
lowest priority at each level and then selects the largest process to 
kill.  This could also support a "kill-all" knob.

	struct mem_cgroup *memcg = root_mem_cgroup;
	struct mem_cgroup *low_memcg;
	unsigned long low_priority;

next:
	low_memcg = NULL;
	low_priority = ULONG_MAX;
	for_each_child_of_memcg(memcg) {
		unsigned long prio = memcg_oom_priority(memcg);

		if (prio < low_priority) {
			low_memcg = memcg;
			low_priority = prio;
		}		
	}
	if (low_memcg)
		goto next;
	oom_kill_process_from_memcg(memcg);

So this is a priority based model that is different than your aggregate 
usage model but I think it allows userspace to define a more powerful 
policy.  We certainly may want to kill from a memcg with a single large 
process, or we may want to kill from a memcg with several small processes, 
it depends on the importance of that job.