linux-kernel - cpu_rmap maps CPUs to wrong interrupts after reprogramming affinities

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADUfDZpUFmBCJPX+u3GYeyFUbQ3RgqevvCpL=ZE48E4_p_BpPA@mail.gmail.com>
Date: Fri, 13 Dec 2024 10:18:30 -0800
From: Caleb Sander <csander@...estorage.com>
To: David Miller <davem@...emloft.net>, Tom Herbert <therbert@...gle.com>, 
	Thomas Gleixner <tglx@...utronix.de>, Eli Cohen <elic@...dia.com>, 
	Ben Hutchings <ben@...adent.org.uk>, Jakub Kicinski <kuba@...nel.org>, 
	Eric Dumazet <edumazet@...gle.com>, Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, 
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: cpu_rmap maps CPUs to wrong interrupts after reprogramming affinities

Hi netdev,
While testing ARFS, we found set_rps_cpu() was calling
ndo_rx_flow_steer() with an RX queue that was not affinitized to the
desired CPU. The issue occurred only after modifying interrupt
affinities. It looks to be a bug in cpu_rmap, where cpu_rmap_update()
can leave CPUs mapped to interrupts which are no longer the most
closely affinitized to them.

Here is the simplest scenario:
1. A network device has 2 IRQs, 1 and 2. Initially only CPU A is
available to process the network device. So both IRQs 1 and 2 are
affinitized to CPU A.
rx_cpu_rmap maps CPU A to IRQ 2 (assuming the affinity of IRQ 2 was
set after IRQ 1)
2. CPU B becomes available to process the network device. So IRQ 2's
affinity is changed from CPU A to CPU B.
cpu_rmap_update() is called for IRQ 2 with its new affinity (CPU B).
It maps CPU B to IRQ 2. CPU A remains mapped to IRQ 2, though with a
higher distance.
rx_cpu_rmap now maps both CPUs A and B to IRQ 2. Any traffic meant to
be steered to CPU A will end up being processed in IRQ 2 on CPU B
instead, even though there is still an IRQ (1) affinitized to CPU A.

If IRQ 1 had been affinitized to CPU A and IRQ 2 to CPU B initially,
the cpu_rmap would have correctly mapped CPU A to IRQ 1 and CPU B to
IRQ 2. So the state of the cpu_rmap depends on the history of the IRQ
affinities, not just the current IRQ affinities.

This behavior was surprising to me, but perhaps it's working as
intended. It seems to be a limitation of struct cpu_rmap: it stores
only one IRQ with the lowest "distance" for each CPU, even if there
are other IRQs of equivalent or higher distance. When an IRQ's
affinity changes, each CPU currently affinitized to it has its
distance invalidated, but its new closest IRQ is selected based on
other CPUs' closest IRQs, ignoring existing IRQs that may be
affinitized to that CPU.

I can see a few possible ways to address this:
- Store the current affinity masks for all the IRQs in struct cpu_rmap
so the next closest IRQ can be computed when a CPU's closest IRQ is
invalidated. This would significantly increase the size of struct
cpu_rmap.
- Store all candidate IRQs and their distances for each CPU in struct
cpu_rmap so the next closest IRQ can be computed when a CPU's closest
IRQ is invalidated. Again, this would significantly increase the size
of struct cpu_rmap.
- Re-fetch the affinity masks of all the IRQs from the irq layer
whenever one IRQ's affinity changes so the next closest IRQ can be
computed for each invalidated CPU. This would avoid using any
additional memory, but would add a lot of calls into the irq layer.
- Work around the cpu_rmap behavior by having userspace always write
to all IRQs' affinity masks when changing the affinity of any one.
This is probably the simplest solution, but I worry that other
userspace applications would hit the same unexpected behavior.

Let me know whether you see this behavior as a bug in cpu_rmap or
something that userspace should work around. If you do think it's a
cpu_rmap bug, how would you like to fix it?

Thanks,
Caleb