lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 16 Mar 2016 16:06:32 +0800
From:	lizf@...nel.org
To:	stable@...r.kernel.org
Cc:	linux-kernel@...r.kernel.org, Ben Zhang <benzh@...omium.org>,
	Don Zickus <dzickus@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Zefan Li <lizefan@...wei.com>
Subject: [PATCH 3.4 098/107] kernel/watchdog.c: touch_nmi_watchdog should only touch local cpu not every one

From: Ben Zhang <benzh@...omium.org>

3.4.111-rc1 review patch.  If anyone has any objections, please let me know.

------------------


commit 62572e29bc530b38921ef6059088b4788a9832a5 upstream.

I ran into a scenario where while one cpu was stuck and should have
panic'd because of the NMI watchdog, it didn't.  The reason was another
cpu was spewing stack dumps on to the console.  Upon investigation, I
noticed that when writing to the console and also when dumping the
stack, the watchdog is touched.

This causes all the cpus to reset their NMI watchdog flags and the
'stuck' cpu just spins forever.

This change causes the semantics of touch_nmi_watchdog to be changed
slightly.  Previously, I accidentally changed the semantics and we
noticed there was a codepath in which touch_nmi_watchdog could be
touched from a preemtible area.  That caused a BUG() to happen when
CONFIG_DEBUG_PREEMPT was enabled.  I believe it was the acpi code.

My attempt here re-introduces the change to have the
touch_nmi_watchdog() code only touch the local cpu instead of all of the
cpus.  But instead of using __get_cpu_var(), I use the
__raw_get_cpu_var() version.

This avoids the preemption problem.  However my reasoning wasn't because
I was trying to be lazy.  Instead I rationalized it as, well if
preemption is enabled then interrupts should be enabled to and the NMI
watchdog will have no reason to trigger.  So it won't matter if the
wrong cpu is touched because the percpu interrupt counters the NMI
watchdog uses should still be incrementing.

Don said:

: I'm ok with this patch, though it does alter the behaviour of how
: touch_nmi_watchdog works.  For the most part I don't think most callers
: need to touch all of the watchdogs (on each cpu).  Perhaps a corner case
: will pop up (the scheduler??  to mimic touch_all_softlockup_watchdogs() ).
:
: But this does address an issue where if a system is locked up and one cpu
: is spewing out useful debug messages (or error messages), the hard lockup
: will fail to go off.  We have seen this on RHEL also.

Signed-off-by: Don Zickus <dzickus@...hat.com>
Signed-off-by: Ben Zhang <benzh@...omium.org>
Signed-off-by: Andrew Morton <akpm@...ux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li <lizefan@...wei.com>
---
 kernel/watchdog.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 991aa93..7527c8c 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -162,6 +162,14 @@ void touch_nmi_watchdog(void)
 				per_cpu(watchdog_nmi_touch, cpu) = true;
 		}
 	}
+	/*
+	 * Using __raw here because some code paths have
+	 * preemption enabled.  If preemption is enabled
+	 * then interrupts should be enabled too, in which
+	 * case we shouldn't have to worry about the watchdog
+	 * going off.
+	 */
+	__raw_get_cpu_var(watchdog_nmi_touch) = true;
 	touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL(touch_nmi_watchdog);
-- 
1.9.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ