linux-kernel - Re: [PATCH] irq: add support for warning on long-running IRQ handlers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9dcfe1fb-0361-4d13-a4d5-fbc274f68db7@nokia.com>
Date: Tue, 1 Jul 2025 08:10:26 +0200
From: Wladislav Wiebe <wladislav.wiebe@...ia.com>
To: Thomas Gleixner <tglx@...utronix.de>, anna-maria@...utronix.de,
 frederic@...nel.org, mingo@...nel.org
Cc: akpm@...ux-foundation.org, bigeasy@...utronix.de, peterz@...radead.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH] irq: add support for warning on long-running IRQ handlers


On 30/06/2025 17:42, Thomas Gleixner wrote:
> On Mon, Jun 30 2025 at 14:46, Wladislav Wiebe wrote:
>
> The subsystem prefix is 'genirq:' See
>
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#patch-subject
>
>> Introduce a new option CONFIG_IRQ_LATENCY_WARN that enables warnings when
>> IRQ handlers take an unusually long time to execute.
>>
>> When triggered, the warning includes the CPU, IRQ number, handler address,
>> name, and execution duration, for example:
>>
>>   [CPU0] latency on IRQ[787:bad_irq_handler+0x1/0x34 [bad_irq]], took: 5 jiffies (~50 ms)
>>
>> To keep runtime overhead minimal, this implementation uses a jiffies-based
>> timing mechanism. While coarse, it is sufficient to detect problematic IRQs.
> Define sufficient. That really depends on your use case. For a real-time
> system a hard interrupt handler running longer than a few microseconds
> can be problematic.
>
> So instead of adding some single purpose mechanism, can we please add
> something flexible which can be used for a wide range of scenarios.

the initial goal was to cover regular non-RT cores, as on isolated/tickless cores
we should not have device interrupts.
However, I agree we could make this more flexible for a wider range of use cases.

>
>> +#ifdef CONFIG_IRQ_LATENCY_WARN
>> +static inline void warn_on_irq_latency(struct irqaction *action, unsigned int irq,
>> +                                    unsigned long jiffies_start)
> latency is the wrong term here. This is about the runtime, duration of
> the handler.
>
>> +{
>> +     unsigned long delta = jiffies - jiffies_start;
>> +
>> +     /*
>> +      * Warn about long IRQ handler latency only if jiffies are reliable.
> What means jiffies are reliable?

there shall be a another CPU online with active ticks enabled
which updates the jiffies. Means, on single or tickless cores, this approach will not work.

>
>> +      * The reporting condition hits only when there are at least two CPUs
>> +      * with active ticks.
>> +      * Jiffies updates are stalled on this CPU until MAX_STALLED_JIFFIES
>> +      * reaches and a force update happens on another CPU with active ticks.
>> +      */
>> +     if (unlikely(delta >= (MAX_STALLED_JIFFIES + CONFIG_IRQ_LATENCY_WARN_THRESHOLD))) {
>> +             pr_warn_ratelimited("[CPU%d] latency on IRQ[%u:%pS], took: %lu jiffies (~%u ms)\n",
>> +                                 smp_processor_id(), irq, action->handler,
>> +                                 delta, jiffies_to_msecs(delta));
>> +     }
>> +}
>> +#else
>> +static inline void warn_on_irq_latency(struct irqaction *action, unsigned int irq,
>> +                                    unsigned long jiffies_start) { }
>> +#endif
> I'm absolutely not fond of this #ifdeffery and yet more Kconfig
> knobs. Something like this should just work:
>
> DEFINE_STATIC_KEY_FALSE(handler_duration_check);
>
>         if (static_branch_unlikely(&handler_duration_check))
>                 ts_start = local_clock();
>         res = action->handler(irq, action->dev_id);
>         if (static_branch_unlikely(&handler_duration_check))
>                 check_handler_duration(ts_start);
>
> Then have a command-line option which allows the user to set a warning
> threshold > 0 in microseconds. When the option is evaluated the
> threshold is stored in a __ro_after_init variable and the static branch
> is enabled.

all right - that makes perfectly sense as well. I will refactor it and provide v2.
Thank you for the comments.

- W.W