lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20131216155522.GO35219@redhat.com>
Date:	Mon, 16 Dec 2013 10:55:22 -0500
From:	Don Zickus <dzickus@...hat.com>
To:	Ben Zhang <benzh@...omium.org>
Cc:	linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Ingo Molnar <mingo@...hat.com>,
	Frederic Weisbecker <fweisbec@...il.com>
Subject: Re: [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu
 not every one

On Thu, Dec 05, 2013 at 12:42:24PM -0800, Ben Zhang wrote:
> I ran into a scenario where while one cpu was stuck and should have panic'd
> because of the NMI watchdog, it didn't.  The reason was another cpu was spewing
> stack dumps on to the console.  Upon investigation, I noticed that when writing
> to the console and also when dumping the stack, the watchdog is touched.
> 
> This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu
> just spins forever.
> 
> This change causes the semantics of touch_nmi_watchdog to be changed slightly.
> Previously, I accidentally changed the semantics and we noticed there was a
> codepath in which touch_nmi_watchdog could be touched from a preemtible area.
> That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled.  I believe
> it was the acpi code.
> 
> My attempt here re-introduces the change to have the touch_nmi_watchdog() code
> only touch the local cpu instead of all of the cpus.  But instead of using
> __get_cpu_var(), I use the __raw_get_cpu_var() version.
> 
> This avoids the preemption problem.  However my reasoning wasn't because I was
> trying to be lazy.  Instead I rationalized it as, well if preemption is enabled
> then interrupts should be enabled to and the NMI watchdog will have no reason
> to trigger.  So it won't matter if the wrong cpu is touched because the percpu
> interrupt counters the NMI watchdog uses should still be incrementing.

Hi Andrew,

I'm ok with this patch, though it does alter the behaviour of how
touch_nmi_watchdog works.  For the most part I don't think most callers
need to touch all of the watchdogs (on each cpu).  Perhaps a corner case
will pop up (the scheduler?? to mimic touch_all_softlockup_watchdogs() ).

But this does address an issue where if a system is locked up and one cpu
is spewing out useful debug messages (or error messages), the hard lockup
will fail to go off.  We have seen this on RHEL also.

I wouldn't mind see this get soaked in linux-next if possible.  And see
what falls out.

Cheers,
Don

> 
> Signed-off-by: Don Zickus <dzickus@...hat.com>
> Signed-off-by: Ben Zhang <benzh@...omium.org>
> ---
>  kernel/watchdog.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 4431610..613f611 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -158,14 +158,14 @@ void touch_all_softlockup_watchdogs(void)
>  #ifdef CONFIG_HARDLOCKUP_DETECTOR
>  void touch_nmi_watchdog(void)
>  {
> -	if (watchdog_user_enabled) {
> -		unsigned cpu;
> -
> -		for_each_present_cpu(cpu) {
> -			if (per_cpu(watchdog_nmi_touch, cpu) != true)
> -				per_cpu(watchdog_nmi_touch, cpu) = true;
> -		}
> -	}
> +	/*
> +	 * Using __raw here because some code paths have
> +	 * preemption enabled.  If preemption is enabled
> +	 * then interrupts should be enabled too, in which
> +	 * case we shouldn't have to worry about the watchdog
> +	 * going off.
> +	 */
> +	__raw_get_cpu_var(watchdog_nmi_touch) = true;
>  	touch_softlockup_watchdog();
>  }
>  EXPORT_SYMBOL(touch_nmi_watchdog);
> -- 
> 1.8.5.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ