linux-kernel - Re: [PATCH] kernel/watchdog: fix spurious hard lockups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170621134055.4skam7ysw5ffvjr6@redhat.com>
Date:   Wed, 21 Jun 2017 09:40:55 -0400
From:   Don Zickus <dzickus@...hat.com>
To:     kan.liang@...el.com
Cc:     linux-kernel@...r.kernel.org, mingo@...nel.org,
        akpm@...ux-foundation.org, babu.moger@...cle.com,
        atomlin@...hat.com, prarit@...hat.com,
        torvalds@...ux-foundation.org, peterz@...radead.org,
        tglx@...utronix.de, eranian@...gle.com, acme@...hat.com,
        ak@...ux.intel.com, stable@...r.kernel.org
Subject: Re: [PATCH] kernel/watchdog: fix spurious hard lockups

On Tue, Jun 20, 2017 at 02:33:09PM -0700, kan.liang@...el.com wrote:
> From: Kan Liang <Kan.liang@...el.com>
> 
> Some users reported spurious NMI watchdog timeouts.
> 
> We now have more and more systems where the Turbo range is wide enough
> that the NMI watchdog expires faster than the soft watchdog timer that
> updates the interrupt tick the NMI watchdog relies on.
> 
> This problem was originally added by commit 58687acba592
> ("lockup_detector: Combine nmi_watchdog and softlockup detector").
> Previously the NMI watchdog would always check jiffies, which were
> ticking fast enough. But now the backing is quite slow so the expire
> time becomes more sensitive.
> 
> For mainline the right fix is to switch the NMI watchdog to reference
> cycles, which tick always at the same rate independent of turbo mode.
> But this is requires some complicated changes in perf, which are too
> difficult to backport. Since we need a stable fix too just increase the
> NMI watchdog rate here to avoid the spurious timeouts. This is not an
> ideal fix because a 3x as large Turbo range could still fail, but for
> now that's not likely.

As this is an Intel problem, we should at least restrict it to 
arch/x86/kernel/apic/hw_nmi.c.  I don't want to penalize other arches yet.

> 
> Signed-off-by: Kan Liang <Kan.liang@...el.com>
> Cc: stable@...r.kernel.org
> Fixes: 58687acba592 ("lockup_detector: Combine nmi_watchdog and
> softlockup detector")
> ---
> 
> The right fix for mainline can be found here.
> perf/x86/intel: enable CPU ref_cycles for GP counter
> perf/x86/intel, watchdog: Switch NMI watchdog to ref cycles on x86
> https://patchwork.kernel.org/patch/9779087/
> https://patchwork.kernel.org/patch/9779089/

Does that mean this fix is restricted to just -stable then?  Otherwise I am
confused why we should take this patch, if you have a better fix above.

Cheers,
Don

> 
>  kernel/watchdog_hld.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c
> index 54a427d1f344..0f7c6e758b82 100644
> --- a/kernel/watchdog_hld.c
> +++ b/kernel/watchdog_hld.c
> @@ -164,7 +164,7 @@ int watchdog_nmi_enable(unsigned int cpu)
>  		firstcpu = 1;
>  
>  	wd_attr = &wd_hw_attr;
> -	wd_attr->sample_period = hw_nmi_get_sample_period(watchdog_thresh);
> +	wd_attr->sample_period = 3 * hw_nmi_get_sample_period(watchdog_thresh);
>  
>  	/* Try to register using hardware perf events */
>  	event = perf_event_create_kernel_counter(wd_attr, cpu, NULL, watchdog_overflow_callback, NULL);
> -- 
> 2.11.0
>