linux-kernel - [RFC PATCH] panic: fix deadlock in panic()

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200603141915.38739-1-cj.chengjian@huawei.com>
Date:   Wed, 3 Jun 2020 14:19:15 +0000
From:   Cheng Jian <cj.chengjian@...wei.com>
To:     <linux-kernel@...r.kernel.org>
CC:     <cj.chengjian@...wei.com>, <chenwandun@...wei.com>,
        <xiexiuqi@...wei.com>, <bobo.shaobowang@...wei.com>,
        <huawei.libin@...wei.com>, <pmladek@...e.com>,
        <sergey.senozhatsky@...il.com>, <rostedt@...dmis.org>
Subject: [RFC PATCH] panic: fix deadlock in panic()

 A deadlock caused by logbuf_lock occurs when panic:

	a) Panic CPU is running in non-NMI context
	b) Panic CPU sends out shutdown IPI via NMI vector
	c) One of the CPUs that we bring down via NMI vector holded logbuf_lock
	d) Panic CPU try to hold logbuf_lock, then deadlock occurs.

we try to re-init the logbuf_lock in printk_safe_flush_on_panic()
to avoid deadlock, but it does not work here, because :

Firstly, it is inappropriate to check num_online_cpus() here.
When the CPU bring down via NMI vector, the panic CPU willn't
wait too long for other cores to stop, so when this problem
occurs, num_online_cpus() may be greater than 1.

Secondly, printk_safe_flush_on_panic() is called after panic
notifier callback, so if printk() is called in panic notifier
callback, deadlock will still occurs. Eg, if ftrace_dump_on_oops
is set, we print some debug information, it will try to hold the
logbuf_lock.

To avoid this deadlock, drop the num_online_cpus() check and call
the printk_safe_flush_on_panic() before panic_notifier_list callback,
attempt to re-init logbuf_lock from panic CPU.

Signed-off-by: Cheng Jian <cj.chengjian@...wei.com>
---
 kernel/panic.c              | 3 +++
 kernel/printk/printk_safe.c | 3 ---
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/panic.c b/kernel/panic.c
index b69ee9e76cb2..8dbcb2227b60 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -255,6 +255,9 @@ void panic(const char *fmt, ...)
 		crash_smp_send_stop();
 	}
 
+	/* Call flush even twice. It tries harder with a single online CPU */
+	printk_safe_flush_on_panic();
+
 	/*
 	 * Run any panic handlers, including those that might need to
 	 * add information to the kmsg dump output.
diff --git a/kernel/printk/printk_safe.c b/kernel/printk/printk_safe.c
index d9a659a686f3..9ebc1723e1a4 100644
--- a/kernel/printk/printk_safe.c
+++ b/kernel/printk/printk_safe.c
@@ -269,9 +269,6 @@ void printk_safe_flush_on_panic(void)
 	 * Do not risk a double release when more CPUs are up.
 	 */
 	if (raw_spin_is_locked(&logbuf_lock)) {
-		if (num_online_cpus() > 1)
-			return;
-
 		debug_locks_off();
 		raw_spin_lock_init(&logbuf_lock);
 	}
-- 
2.17.1