linux-kernel - [patch] mm, oom: make a last minute check to prevent unnecessary memcg oom kills

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.21.2003101454580.142656@chino.kir.corp.google.com>
Date:   Tue, 10 Mar 2020 14:55:50 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Andrew Morton <akpm@...ux-foundation.org>
cc:     Vlastimil Babka <vbabka@...e.cz>, Michal Hocko <mhocko@...nel.org>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: [patch] mm, oom: make a last minute check to prevent unnecessary
 memcg oom kills

Killing a user process as a result of hitting memcg limits is a serious
decision that is unfortunately needed only when no forward progress in
reclaiming memory can be made.

Deciding the appropriate oom victim can take a sufficient amount of time
that allows another process that is exiting to actually uncharge to the
same memcg hierarchy and prevent unnecessarily killing user processes.

An example is to prevent *multiple* unnecessary oom kills on a system
with two cores where the oom kill occurs when there is an abundance of
free memory available:

Memory cgroup out of memory: Killed process 628 (repro) total-vm:41944kB, anon-rss:40888kB, file-rss:496kB, shmem-rss:0kB, UID:0 pgtables:116kB oom_score_adj:0
<immediately after>
repro invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
CPU: 1 PID: 629 Comm: repro Not tainted 5.6.0-rc5+ #130
Call Trace:
 dump_stack+0x78/0xb6
 dump_header+0x55/0x240
 oom_kill_process+0xc5/0x170
 out_of_memory+0x305/0x4a0
 try_charge+0x77b/0xac0
 mem_cgroup_try_charge+0x10a/0x220
 mem_cgroup_try_charge_delay+0x1e/0x40
 handle_mm_fault+0xdf2/0x15f0
 do_user_addr_fault+0x21f/0x420
 async_page_fault+0x2f/0x40
memory: usage 61336kB, limit 102400kB, failcnt 74

Notice the second memcg oom kill shows usage is >40MB below its limit of
100MB but a process is still unnecessarily killed because the decision has
already been made to oom kill by calling out_of_memory() before the
initial victim had a chance to uncharge its memory.

Make a last minute check to determine if an oom kill is really needed to
prevent unnecessary oom killing.

Cc: Vlastimil Babka <vbabka@...e.cz>
Cc: Michal Hocko <mhocko@...nel.org>
Cc: stable@...r.kernel.org
Signed-off-by: David Rientjes <rientjes@...gle.com>
---
 include/linux/memcontrol.h |  7 +++++++
 mm/memcontrol.c            |  2 +-
 mm/oom_kill.c              | 16 +++++++++++++---
 3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -445,6 +445,8 @@ void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 int mem_cgroup_scan_tasks(struct mem_cgroup *,
 			  int (*)(struct task_struct *, void *), void *);
 
+unsigned long mem_cgroup_margin(struct mem_cgroup *memcg);
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	if (mem_cgroup_disabled())
@@ -945,6 +947,11 @@ static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return 0;
 }
 
+static inline unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	return 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1286,7 +1286,7 @@ void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
  * Returns the maximum amount of memory @mem can be charged with, in
  * pages.
  */
-static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
+unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 {
 	unsigned long margin = 0;
 	unsigned long count;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -972,9 +972,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	}
 	task_unlock(victim);
 
-	if (__ratelimit(&oom_rs))
-		dump_header(oc, victim);
-
 	/*
 	 * Do we need to kill the entire memory cgroup?
 	 * Or even one of the ancestor memory cgroups?
@@ -982,6 +979,19 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
 	 */
 	oom_group = mem_cgroup_get_oom_group(victim, oc->memcg);
 
+	if (is_memcg_oom(oc)) {
+		cond_resched();
+
+		/* One last check: do we *really* need to kill? */
+		if (mem_cgroup_margin(oc->memcg) >= (1 << oc->order)) {
+			put_task_struct(victim);
+			return;
+		}
+	}
+
+	if (__ratelimit(&oom_rs))
+		dump_header(oc, victim);
+
 	__oom_kill_process(victim, message);
 
 	/*