[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7301930a-12f8-46d3-b427-144c40d8edb4@linux.alibaba.com>
Date: Thu, 25 Jan 2024 15:50:51 +0800
From: Bitao Hu <yaoma@...ux.alibaba.com>
To: Doug Anderson <dianders@...omium.org>
Cc: akpm@...ux-foundation.org, pmladek@...e.com, tglx@...utronix.de,
maz@...nel.org, liusong@...ux.alibaba.com, linux-kernel@...r.kernel.org,
yaoma@...ux.alibaba.com
Subject: Re: [PATCH 2/3] watchdog/softlockup: report the most time-consuming
hardirq
On 2024/1/25 08:19, Doug Anderson wrote:
> Hi,
>
> On Tue, Jan 23, 2024 at 4:12 AM Bitao Hu <yaoma@...ux.alibaba.com> wrote:
>>
>> When the watchdog determines that the current soft lockup is due
>> to an interrupt storm based on CPU utilization, reporting the
>> top three most time-consuming hardirq can be quite useful for
>> further troubleshooting.
>> Below is an example of interrupt storm. The call tree does not
>> provide useful information, but we can analyze which interrupt
>> caused the soft lockup by using the time-consuming information
>> of hardirq.
>>
>> [ 67.714044] watchdog: BUG: soft lockup - CPU#9 stuck for 28s! [swapper/9:0]
>> [ 67.714548] CPU#9 Utilization every 4s during lockup:
>> [ 67.714549] #1: 0.0% system, 0.0% softirq, 95.0% hardirq, 0.0% idle
>> [ 67.714551] #2: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
>> [ 67.714553] #3: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
>> [ 67.714555] #4: 0.0% system, 0.0% softirq, 95.0% hardirq, 0.0% idle
>> [ 67.714556] #5: 0.0% system, 0.0% softirq, 90.0% hardirq, 0.0% idle
>> [ 67.714570] CPU#9 Detect HardIRQ Time exceeds 50% since 45s. Most time-consuming HardIRQs:
>> [ 67.714571] #1: 99.9% irq#7(IPI)
>> [ 67.714575] #2: 0.0% irq#10(arch_timer)
>> [ 67.714578] #3: 0.0% irq#2(IPI)
>> ...
>> [ 67.714654] Call trace:
>> [ 67.714656] __do_softirq+0xa8/0x364
>>
>> Signed-off-by: Bitao Hu <yaoma@...ux.alibaba.com>
>> ---
>> include/linux/irq.h | 9 ++
>> include/linux/irqdesc.h | 2 +
>> kernel/irq/irqdesc.c | 9 +-
>> kernel/watchdog.c | 186 ++++++++++++++++++++++++++++++++++++++++
>> 4 files changed, 205 insertions(+), 1 deletion(-)
>
> Just like on patch #1, I'd overall like to say that I'm very excited
> about this patch and think it'll help track down a lot of issues, so
> please continue moving forward with it!
Sure, I will continue moving forward with it.
>
> That being said, my gut says that this can be a lot simpler / lower
> overhead if we just use the existing IRQ counts that the framework is
> already keeping track of for us. We could look for the IRQs that had
> the most counts on the soft locked CPU over the period of time.
>
> It might not be quite as good/accurate compared to what you have here
> but it seems like it would be "good enough". Normally when there's an
> IRQ storm that's so bad that it's causing a softlockup then each IRQ
> isn't actually that slow but they're going off nonstop. It would catch
> that case perfectly.
>
> I guess maybe (?) the case it wouldn't catch so well would be if you
> had a loop that looked like:
>
> * CPU from "storming" device takes ~10 ms and finishes.
> * After the "storming" device finishes, we somehow manage to service a
> whole pile of normal interrupts from non-storming devices.
> * After the non-storming interrupts finished, then we went back to
> servicing the slow storming interrupt.
I think there are two issues with using IRQ counts in determining
interrupt storm:
1. we might not know the normal frequency for each interrupt, witch
making the IRQ counts inconvenient for comparison.
For instance, in your example, suppose the slow interrupt takes 10ms at
a frequency of 100Hz; the normal interrupt takes 10us at a frequency of
500Hz. If we use IRQ counts, we might incorrectly conclude that it is
the normal interrupt causing softlockup.
2. During an interrupt storm, the IRQ counts recorded by the kernel does
not necessarily increase.
The softlockup caused by an NVMe interrupt storm that I encountered is a
case in point. Around the time of the softlockup, the NVMe interrupt
frequency actually decreased from ~300 kHz to ~17 kHz. However, the
average time taken per NVMe interrupt increased from ~2us to ~60us.
This is because the nvme_irq handler is capable of polling and
processing multiple interrupts reported by the NVMe controller, which
are not captured in the kernel's interrupt statistics. The NVMe driver
handles interrupts frequency is ~660 kHz actually.
Using interrupt time as a criterion of interrupt storm can avoid the
above two points.
>
> I haven't dug into the bowels of the Linux IRQ handling well enough to
> know if that would be possible. However, even if it was possible it
> doesn't feel likely to happen. If those "normal" interrupts aren't
> storming then it seems unlikely we'd see more than one of each of them
> between the "storming" interrupts. Thus it feels like the "storming"
> interrupt would still have the highest (or tied for the highest) count
> on the locked up CPU. Maybe you could print the top 4 or 5 (instead of
> the top 3) and you'd be pretty certain to catch even this case?
>
> In any case, I'll wait before doing a more thorough review for now and
> hope the above sounds right to you.
>
> -Doug
Powered by blists - more mailing lists