linux-kernel - Re: [PATCH] irq: add support for warning on long-running IRQ handlers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <875xgdnsmr.ffs@tglx>
Date: Mon, 30 Jun 2025 17:42:04 +0200
From: Thomas Gleixner <tglx@...utronix.de>
To: Wladislav Wiebe <wladislav.wiebe@...ia.com>, anna-maria@...utronix.de,
 frederic@...nel.org, mingo@...nel.org
Cc: akpm@...ux-foundation.org, bigeasy@...utronix.de, peterz@...radead.org,
 linux-kernel@...r.kernel.org, wladislav.wiebe@...ia.com
Subject: Re: [PATCH] irq: add support for warning on long-running IRQ handlers

On Mon, Jun 30 2025 at 14:46, Wladislav Wiebe wrote:

The subsystem prefix is 'genirq:' See

https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#patch-subject

> Introduce a new option CONFIG_IRQ_LATENCY_WARN that enables warnings when
> IRQ handlers take an unusually long time to execute.
>
> When triggered, the warning includes the CPU, IRQ number, handler address,
> name, and execution duration, for example:
>
>   [CPU0] latency on IRQ[787:bad_irq_handler+0x1/0x34 [bad_irq]], took: 5 jiffies (~50 ms)
>
> To keep runtime overhead minimal, this implementation uses a jiffies-based
> timing mechanism. While coarse, it is sufficient to detect problematic IRQs.

Define sufficient. That really depends on your use case. For a real-time
system a hard interrupt handler running longer than a few microseconds
can be problematic.

So instead of adding some single purpose mechanism, can we please add
something flexible which can be used for a wide range of scenarios.

> +#ifdef CONFIG_IRQ_LATENCY_WARN
> +static inline void warn_on_irq_latency(struct irqaction *action, unsigned int irq,
> +				       unsigned long jiffies_start)

latency is the wrong term here. This is about the runtime, duration of
the handler.

> +{
> +	unsigned long delta = jiffies - jiffies_start;
> +
> +	/*
> +	 * Warn about long IRQ handler latency only if jiffies are reliable.

What means jiffies are reliable?

> +	 * The reporting condition hits only when there are at least two CPUs
> +	 * with active ticks.
> +	 * Jiffies updates are stalled on this CPU until MAX_STALLED_JIFFIES
> +	 * reaches and a force update happens on another CPU with active ticks.
> +	 */
> +	if (unlikely(delta >= (MAX_STALLED_JIFFIES + CONFIG_IRQ_LATENCY_WARN_THRESHOLD))) {
> +		pr_warn_ratelimited("[CPU%d] latency on IRQ[%u:%pS], took: %lu jiffies (~%u ms)\n",
> +				    smp_processor_id(), irq, action->handler,
> +				    delta, jiffies_to_msecs(delta));
> +	}
> +}
> +#else
> +static inline void warn_on_irq_latency(struct irqaction *action, unsigned int irq,
> +				       unsigned long jiffies_start) { }
> +#endif

I'm absolutely not fond of this #ifdeffery and yet more Kconfig
knobs. Something like this should just work:

DEFINE_STATIC_KEY_FALSE(handler_duration_check);

        if (static_branch_unlikely(&handler_duration_check))
		ts_start = local_clock();
	res = action->handler(irq, action->dev_id);
        if (static_branch_unlikely(&handler_duration_check))
		check_handler_duration(ts_start);

Then have a command-line option which allows the user to set a warning
threshold > 0 in microseconds. When the option is evaluated the
threshold is stored in a __ro_after_init variable and the static branch
is enabled.

Hmm?

Thanks,

        tglx