[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAD=FV=W2g==7vQYP06_WfaVp_sPV16zX7_3V55J5AXCekT8taA@mail.gmail.com>
Date: Thu, 8 Feb 2024 08:03:39 -0800
From: Doug Anderson <dianders@...omium.org>
To: Bitao Hu <yaoma@...ux.alibaba.com>
Cc: akpm@...ux-foundation.org, pmladek@...e.com, kernelfans@...il.com,
liusong@...ux.alibaba.com, linux-kernel@...r.kernel.org
Subject: Re: [PATCHv6 1/2] watchdog/softlockup: low-overhead detection of interrupt
Hi,
On Thu, Feb 8, 2024 at 4:54 AM Bitao Hu <yaoma@...ux.alibaba.com> wrote:
>
> The following softlockup is caused by interrupt storm, but it cannot be
> identified from the call tree. Because the call tree is just a snapshot
> and doesn't fully capture the behavior of the CPU during the soft lockup.
> watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
> ...
> Call trace:
> __do_softirq+0xa0/0x37c
> __irq_exit_rcu+0x108/0x140
> irq_exit+0x14/0x20
> __handle_domain_irq+0x84/0xe0
> gic_handle_irq+0x80/0x108
> el0_irq_naked+0x50/0x58
>
> Therefore,I think it is necessary to report CPU utilization during the
> softlockup_thresh period (report once every sample_period, for a total
> of 5 reportings), like this:
> watchdog: BUG: soft lockup - CPU#28 stuck for 23s! [fio:83921]
> CPU#28 Utilization every 4s during lockup:
> #1: 0% system, 0% softirq, 100% hardirq, 0% idle
> #2: 0% system, 0% softirq, 100% hardirq, 0% idle
> #3: 0% system, 0% softirq, 100% hardirq, 0% idle
> #4: 0% system, 0% softirq, 100% hardirq, 0% idle
> #5: 0% system, 0% softirq, 100% hardirq, 0% idle
> ...
>
> This would be helpful in determining whether an interrupt storm has
> occurred or in identifying the cause of the softlockup. The criteria for
> determination are as follows:
> a. If the hardirq utilization is high, then interrupt storm should be
> considered and the root cause cannot be determined from the call tree.
> b. If the softirq utilization is high, then we could analyze the call
> tree but it may cannot reflect the root cause.
> c. If the system utilization is high, then we could analyze the root
> cause from the call tree.
>
> The mechanism requires a considerable amount of global storage space
> when configured for the maximum number of CPUs. Therefore, adding a
> SOFTLOCKUP_DETECTOR_INTR_STORM Kconfig knob that defaults to "yes"
> if the max number of CPUs is <= 128.
>
> Signed-off-by: Bitao Hu <yaoma@...ux.alibaba.com>
> ---
> kernel/watchdog.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++
> lib/Kconfig.debug | 13 +++++++
> 2 files changed, 104 insertions(+)
Thanks, this looks great now!
Reviewed-by: Douglas Anderson <dianders@...omium.org>
Powered by blists - more mailing lists