linux-kernel - Re: IRQ thread timeouts and affinity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6hb5vyl5xxsxfcwk4v3xpq277wusj5jq4tubdpjocpjc5smj3w@wx574kluhedj>
Date: Tue, 14 Oct 2025 12:50:18 +0200
From: Thierry Reding <thierry.reding@...il.com>
To: Marc Zyngier <maz@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>, linux-tegra@...r.kernel.org, 
	linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org
Subject: Re: IRQ thread timeouts and affinity

On Sat, Oct 11, 2025 at 11:00:11AM +0100, Marc Zyngier wrote:
> On Fri, 10 Oct 2025 16:03:01 +0100,
> Thierry Reding <thierry.reding@...il.com> wrote:
> > 
> > On Fri, Oct 10, 2025 at 03:18:13PM +0100, Marc Zyngier wrote:
> > > 
> > > CPU hotplug is the main area of concern, and I'm pretty sure it breaks
> > > this distribution mechanism (or the other way around). Another thing
> > > is that if firmware isn't aware that 1:N interrupts can (or should)
> > > wake-up a CPU from sleep, bad things will happen. Given that nobody
> > > uses 1:N, you can bet that any bit of privileged SW (TF-A,
> > > hypervisors) is likely to be buggy (I've already spotted bugs in KVM
> > > around this).
> > 
> > Okay, I can find out if CPU hotplug is a common use-case on these
> > devices, or if we can run some tests with that.
> 
> It's not so much whether CPU hotplug is of any use to your particular
> box, but whether this has any detrimental impact on *any* machine
> doing CPU hotplug.
> 
> To be clear, this stuff doesn't go in if something breaks, no matter
> how small.

Of course. I do want to find a way to move forward with this, so I'm
trying to find ways to check what impact this would have in conjunction
with CPU hotplug.

I've done some minimal testing on a Tegra264 device where we have less
CPUs. With your patch applied, I see that most interrupts are nicely
distributed across CPUs. I'm going to use the serial interrupt as an
example since it reliably triggers when I test on a system. Here's an
extract after boot:

	# cat /proc/interrupts
	           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
	 25:         42         44         41         29         37         36         39         36    GICv3 547 Level     c4e0000.serial

I then took CPU 1 offline:

	# echo 0 > /sys/devices/system/cpu/cpu1/online

After that it looks like the GIC automatically reverts to using the
first CPU, since after a little while:

	# cat /proc/interrupts
	           CPU0       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
	 25:        186         66         52         64         58         67         62    GICv3 547 Level     c4e0000.serial

The interrupt count for CPUs 2-7 no longer increments after taking CPU 1
offline. Interestingly, bringing CPU 1 back online doesn't have an
impact, so it doesn't go back to enabling 1:N mode.

Nothing did seem to break. Obviously this doesn't show anything about
the performance yet, but it looks like at least things don't crash and
burn.

Anything else that you think I can test? Do we have a way of restoring
1:N when all CPUs are back online?

Thierry

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)