lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150617121104.GD25056@dhcp22.suse.cz>
Date:	Wed, 17 Jun 2015 14:11:04 +0200
From:	Michal Hocko <mhocko@...e.cz>
To:	linux-mm@...ck.org
Cc:	David Rientjes <rientjes@...gle.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Tejun Heo <tj@...nel.org>,
	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
	Andrew Morton <akpm@...ux-foundation.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: [RFC -v2] panic_on_oom_timeout

Hi,
I was thinking about this and I am more and more convinced that we
shouldn't care about panic_on_oom=2 configuration for now and go with
the simplest solution first. I have revisited my original patch and
replaced delayed work by a timer based on the feedback from Tetsuo.

I think we can rely on timers. A downside would be that we cannot dump
the full OOM report from the IRQ context because we rely on task_lock
which is not IRQ safe. But I do not think we really need it. An OOM
report will be in the log already most of the time and show_mem will
tell us the current memory situation.

What do you think?
---
>From cec53a806cc19b691d6a63544a1283aa90cf92f1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@...e.cz>
Date: Tue, 9 Jun 2015 16:15:42 +0200
Subject: [PATCH] oom: implement panic_on_oom_timeout for panic_on_oom=1

OOM killer is a desparate last resort reclaim attempt to free some
memory. It is based on heuristics which will never be 100% and may
result in an unusable or a locked up system.

panic_on_oom sysctl knob allows to set the OOM policy to panic the
system instead of trying to resolve the OOM condition. This might be
useful for several reasons - e.g. reduce the downtime to a predictable
amount of time, allow to get a crash dump of the system and debug the
issue post-mortem.

panic_on_oom is, however, a big hammer in many situations when the
OOM condition could be resolved in a reasonable time. So it would be
good to have some middle ground and allow the OOM killer to do its job
but have a failover when things go wrong and it is not able to make any
further progress for a considerable amount of time.

This patch implements panic_on_oom_timeout sysctl which is active only
when panic_on_oom=1 and it configures a maximum timeout for the OOM
killer to resolve the OOM situation. If the system is still under OOM
after the timeout expires it will panic the system. A reasonably chosen
timeout can protect from both temporal OOM conditions and allows to have
a predictable time frame for the OOM condition.

The feature is implemented by a timer which is scheduled when the OOM
condition is declared for the first time (oom_victims is still zero)
in out_of_memory and it is canceled in exit_oom_victim after the
oom_victims count drops down to zero.
For this time period OOM killer cannot kill new tasks and it only
allows exiting or killed tasks to access memory reserves (and increase
oom_victims counter via mark_oom_victim) in order to make a progress
so it is reasonable to consider the elevated oom_victims count as an
ongoing OOM condition.

Timeout for panic_on_oom=2 is not currently supported because it
would make the code more complicated and it is even not clear whether
this combination is useful. panic_on_oom=2 is disputable on its own
because the system is still usable at the time so the administrator can
intervene to resolve the OOM situation (e.g. relax NUMA restrictions,
increase memory limit for the oom memcg, reboot/panic the system or
simply selectively kill a task which might be blocking oom killer from
making progress).

The log will then contain something like:
[   32.071128] Out of memory: Kill process 2998 (mem_eater) score 42 or sacrifice child
[   32.074022] Killed process 2998 (mem_eater) total-vm:14172kB, anon-rss:628kB, file-rss:4kB
[   32.091849] mem_eater invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[   32.103139] mem_eater cpuset=/ mems_allowed=0
[...]
[   32.354388] Killed process 3001 (mem_eater) total-vm:14172kB, anon-rss:580kB, file-rss:4kB
[   33.088078] panic_on_oom timeout 1s has expired
[   33.089062] Mem-Info:
[   33.089557] active_anon:7991 inactive_anon:8106 isolated_anon:665
[   33.089557]  active_file:183 inactive_file:125 isolated_file:0
[   33.089557]  unevictable:0 dirty:0 writeback:1037 unstable:0
[   33.089557]  slab_reclaimable:2492 slab_unreclaimable:3253
[   33.089557]  mapped:275 shmem:0 pagetables:631 bounce:0
[   33.089557]  free:400 free_pcp:20 free_cma:0
[...]
[   33.092023] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[   33.092023]
[   33.092023] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.0.0-oomtimeout5-00001-g667aff689092 #582
[   33.092023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
[   33.092023]  0000000000000101 ffff880007803da8 ffffffff81538b24 ffffffff8108a645
[   33.092023]  ffffffff817ee27d ffff880007803e28 ffffffff81537ac6 ffff880007803e08
[   33.092023]  ffff880000000008 ffff880007803e38 ffff880007803dd8 ffffffff8108a645
[   33.092023] Call Trace:
[   33.092023]  <IRQ>  [<ffffffff81538b24>] dump_stack+0x4f/0x7b
[   33.092023]  [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[   33.092023]  [<ffffffff81537ac6>] panic+0xbe/0x1eb
[   33.092023]  [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[   33.092023]  [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[   33.092023]  [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d
[   33.092023]  [<ffffffff810fc596>] delayed_panic_on_oom+0x35/0x35
[   33.092023]  [<ffffffff8109866f>] call_timer_fn+0xa7/0x1d0
[   33.092023]  [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d
[   33.092023]  [<ffffffff81098eb6>] run_timer_softirq+0x22d/0x2a1
[   33.092023]  [<ffffffff810463f3>] __do_softirq+0x141/0x322
[   33.092023]  [<ffffffff810467af>] irq_exit+0x6f/0xb6

TODO: Documentation update
Signed-off-by: Michal Hocko <mhocko@...e.cz>
---
 include/linux/oom.h |  1 +
 kernel/sysctl.c     |  8 ++++++++
 mm/oom_kill.c       | 44 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 061e0ffd3493..6884c8dc37a0 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -100,4 +100,5 @@ static inline bool task_will_free_mem(struct task_struct *task)
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern int sysctl_panic_on_oom_timeout;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d6fff89b78db..3ac2e5d0b1e2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1141,6 +1141,14 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &two,
 	},
 	{
+		.procname	= "panic_on_oom_timeout",
+		.data		= &sysctl_panic_on_oom_timeout,
+		.maxlen		= sizeof(sysctl_panic_on_oom_timeout),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "oom_kill_allocating_task",
 		.data		= &sysctl_oom_kill_allocating_task,
 		.maxlen		= sizeof(sysctl_oom_kill_allocating_task),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7fb1275e200..d48e02627ad0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -40,6 +40,7 @@
 #include <trace/events/oom.h>
 
 int sysctl_panic_on_oom;
+int sysctl_panic_on_oom_timeout;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 
@@ -430,6 +431,16 @@ void mark_oom_victim(struct task_struct *tsk)
 	atomic_inc(&oom_victims);
 }
 
+static void delayed_panic_on_oom(unsigned long unused)
+{
+	/* We are in the irq context so we cannot call dump_header */
+	pr_info("panic_on_oom timeout %ds has expired\n", sysctl_panic_on_oom_timeout);
+	show_mem(SHOW_MEM_FILTER_NODES);
+	panic("Out of memory: system-wide panic_on_oom is enabled\n");
+}
+
+static DEFINE_TIMER(panic_on_oom_timer, delayed_panic_on_oom, 0, 0);
+
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
@@ -437,8 +448,10 @@ void exit_oom_victim(void)
 {
 	clear_thread_flag(TIF_MEMDIE);
 
-	if (!atomic_dec_return(&oom_victims))
+	if (!atomic_dec_return(&oom_victims)) {
+		del_timer(&panic_on_oom_timer);
 		wake_up_all(&oom_victims_wait);
+	}
 }
 
 /**
@@ -539,7 +552,7 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
-		return;
+		goto no_oom_victim;
 	} else if (victim != p) {
 		get_task_struct(p);
 		put_task_struct(victim);
@@ -581,6 +594,15 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
+	return;
+no_oom_victim:
+	/*
+	 * If there is no oom victim selected (e.g. we might have raced with
+	 * an exiting task) then cancel delayed oom panic handler.
+	 */
+	if (!atomic_read(&oom_victims))
+		del_timer(&panic_on_oom_timer);
+
 }
 #undef K
 
@@ -624,6 +646,24 @@ void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 		if (constraint != CONSTRAINT_NONE)
 			return;
 	}
+
+	if (sysctl_panic_on_oom_timeout) {
+		if (sysctl_panic_on_oom > 1) {
+			pr_warn("panic_on_oom_timeout is ignored for panic_on_oom=2\n");
+		} else {
+			/*
+			 * Only schedule the delayed panic_on_oom when this is
+			 * the first OOM triggered. oom_lock will protect us
+			 * from races
+			 */
+			if (atomic_read(&oom_victims))
+				return;
+
+			mod_timer(&panic_on_oom_timer,
+					jiffies + (sysctl_panic_on_oom_timeout * HZ));
+			return;
+		}
+	}
 	dump_header(NULL, gfp_mask, order, memcg, nodemask);
 	panic("Out of memory: %s panic_on_oom is enabled\n",
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
-- 
2.1.4


-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ