lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 8 Feb 2024 08:03:39 -0800
From: Doug Anderson <dianders@...omium.org>
To: Bitao Hu <yaoma@...ux.alibaba.com>
Cc: akpm@...ux-foundation.org, pmladek@...e.com, kernelfans@...il.com, 
	liusong@...ux.alibaba.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCHv6 1/2] watchdog/softlockup: low-overhead detection of interrupt

Hi,

On Thu, Feb 8, 2024 at 4:54 AM Bitao Hu <yaoma@...ux.alibaba.com> wrote:
>
> The following softlockup is caused by interrupt storm, but it cannot be
> identified from the call tree. Because the call tree is just a snapshot
> and doesn't fully capture the behavior of the CPU during the soft lockup.
>   watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>   ...
>   Call trace:
>     __do_softirq+0xa0/0x37c
>     __irq_exit_rcu+0x108/0x140
>     irq_exit+0x14/0x20
>     __handle_domain_irq+0x84/0xe0
>     gic_handle_irq+0x80/0x108
>     el0_irq_naked+0x50/0x58
>
> Therefore,I think it is necessary to report CPU utilization during the
> softlockup_thresh period (report once every sample_period, for a total
> of 5 reportings), like this:
>   watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
>   CPU#28 Utilization every 4s during lockup:
>     #1: 0% system, 0% softirq, 100% hardirq, 0% idle
>     #2: 0% system, 0% softirq, 100% hardirq, 0% idle
>     #3: 0% system, 0% softirq, 100% hardirq, 0% idle
>     #4: 0% system, 0% softirq, 100% hardirq, 0% idle
>     #5: 0% system, 0% softirq, 100% hardirq, 0% idle
>   ...
>
> This would be helpful in determining whether an interrupt storm has
> occurred or in identifying the cause of the softlockup. The criteria for
> determination are as follows:
>   a. If the hardirq utilization is high, then interrupt storm should be
>   considered and the root cause cannot be determined from the call tree.
>   b. If the softirq utilization is high, then we could analyze the call
>   tree but it may cannot reflect the root cause.
>   c. If the system utilization is high, then we could analyze the root
>   cause from the call tree.
>
> The mechanism requires a considerable amount of global storage space
> when configured for the maximum number of CPUs. Therefore, adding a
> SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes"
> if the max number of CPUs is <= 128.
>
> Signed-off-by: Bitao Hu <yaoma@...ux.alibaba.com>
> ---
>  kernel/watchdog.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++
>  lib/Kconfig.debug | 13 +++++++
>  2 files changed, 104 insertions(+)

Thanks, this looks great now!

Reviewed-by: Douglas Anderson <dianders@...omium.org>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ