lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87zgimfff6.fsf@mpe.ellerman.id.au>
Date:   Thu, 09 Jun 2022 17:45:49 +1000
From:   Michael Ellerman <michaele@....ibm.com>
To:     Nathan Lynch <nathanl@...ux.ibm.com>,
        Laurent Dufour <ldufour@...ux.ibm.com>
Cc:     linux-kernel@...r.kernel.org, npiggin@...il.com, paulus@...ba.org,
        linuxppc-dev@...ts.ozlabs.org, haren@...ux.vnet.ibm.com
Subject: Re: [PATCH 0/2] Disabling NMI watchdog during LPM's memory transfer

Nathan Lynch <nathanl@...ux.ibm.com> writes:
> Laurent Dufour <ldufour@...ux.ibm.com> writes:
...
>
>> There are  ongoing investigations to clarify where and how this latency is
>> happening. I'm not excluding any other issue in the Linux kernel, but right
>> now, this looks to be the best option to prevent system crash during
>> LPM.
>
> It will prevent the likely crash mode for enterprise distros with
> default watchdog tunables that our internal test environments happen to
> use. But if someone were to run the same scenario with softlockup_panic
> enabled, or with the RCU stall timeout lower than the watchdog
> threshold, the failure mode would be different.
>
> Basically I'm saying:
> * Some users may actually want the OS to panic when it's in this state,
>   because their applications can't work correctly.
> * But if we're going to inhibit one watchdog, we should inhibit them
>   all.

I'm sympathetic to both of your arguments.

But I think there is a key difference between the NMI watchdog and other
watchdogs, which is that the NMI watchdog will use the unsafe NMI to
interrupt other CPUs, and that can cause the system to crash when other
watchdogs would just print a backtrace.

We had the same problem with the rcu_sched stall detector until we
changed it to use the "safe" NMI, see:
  5cc05910f26e ("powerpc/64s: Wire up arch_trigger_cpumask_backtrace()")


So even if the NMI watchdog is disabled there are still the other
watchdogs enabled, which should print backtraces by default, and if
desired can also be configured to cause a panic.

Instead of disabling the NMI watchdog, can we instead increase the
timeout (by how much?) during LPM, so that it is less likely to fire in
normal usage, but is still there as a backup if the system is completely
clogged.

cheers

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ