[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9dcfe1fb-0361-4d13-a4d5-fbc274f68db7@nokia.com>
Date: Tue, 1 Jul 2025 08:10:26 +0200
From: Wladislav Wiebe <wladislav.wiebe@...ia.com>
To: Thomas Gleixner <tglx@...utronix.de>, anna-maria@...utronix.de,
frederic@...nel.org, mingo@...nel.org
Cc: akpm@...ux-foundation.org, bigeasy@...utronix.de, peterz@...radead.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] irq: add support for warning on long-running IRQ handlers
On 30/06/2025 17:42, Thomas Gleixner wrote:
> On Mon, Jun 30 2025 at 14:46, Wladislav Wiebe wrote:
>
> The subsystem prefix is 'genirq:' See
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#patch-subject
>
>> Introduce a new option CONFIG_IRQ_LATENCY_WARN that enables warnings when
>> IRQ handlers take an unusually long time to execute.
>>
>> When triggered, the warning includes the CPU, IRQ number, handler address,
>> name, and execution duration, for example:
>>
>> [CPU0] latency on IRQ[787:bad_irq_handler+0x1/0x34 [bad_irq]], took: 5 jiffies (~50 ms)
>>
>> To keep runtime overhead minimal, this implementation uses a jiffies-based
>> timing mechanism. While coarse, it is sufficient to detect problematic IRQs.
> Define sufficient. That really depends on your use case. For a real-time
> system a hard interrupt handler running longer than a few microseconds
> can be problematic.
>
> So instead of adding some single purpose mechanism, can we please add
> something flexible which can be used for a wide range of scenarios.
the initial goal was to cover regular non-RT cores, as on isolated/tickless cores
we should not have device interrupts.
However, I agree we could make this more flexible for a wider range of use cases.
>
>> +#ifdef CONFIG_IRQ_LATENCY_WARN
>> +static inline void warn_on_irq_latency(struct irqaction *action, unsigned int irq,
>> + unsigned long jiffies_start)
> latency is the wrong term here. This is about the runtime, duration of
> the handler.
>
>> +{
>> + unsigned long delta = jiffies - jiffies_start;
>> +
>> + /*
>> + * Warn about long IRQ handler latency only if jiffies are reliable.
> What means jiffies are reliable?
there shall be a another CPU online with active ticks enabled
which updates the jiffies. Means, on single or tickless cores, this approach will not work.
>
>> + * The reporting condition hits only when there are at least two CPUs
>> + * with active ticks.
>> + * Jiffies updates are stalled on this CPU until MAX_STALLED_JIFFIES
>> + * reaches and a force update happens on another CPU with active ticks.
>> + */
>> + if (unlikely(delta >= (MAX_STALLED_JIFFIES + CONFIG_IRQ_LATENCY_WARN_THRESHOLD))) {
>> + pr_warn_ratelimited("[CPU%d] latency on IRQ[%u:%pS], took: %lu jiffies (~%u ms)\n",
>> + smp_processor_id(), irq, action->handler,
>> + delta, jiffies_to_msecs(delta));
>> + }
>> +}
>> +#else
>> +static inline void warn_on_irq_latency(struct irqaction *action, unsigned int irq,
>> + unsigned long jiffies_start) { }
>> +#endif
> I'm absolutely not fond of this #ifdeffery and yet more Kconfig
> knobs. Something like this should just work:
>
> DEFINE_STATIC_KEY_FALSE(handler_duration_check);
>
> if (static_branch_unlikely(&handler_duration_check))
> ts_start = local_clock();
> res = action->handler(irq, action->dev_id);
> if (static_branch_unlikely(&handler_duration_check))
> check_handler_duration(ts_start);
>
> Then have a command-line option which allows the user to set a warning
> threshold > 0 in microseconds. When the option is evaluated the
> threshold is stored in a __ro_after_init variable and the static branch
> is enabled.
all right - that makes perfectly sense as well. I will refactor it and provide v2.
Thank you for the comments.
- W.W
Powered by blists - more mailing lists