linux-kernel - Re: [patch 1/2] irqchip/msi-lib: Honor the MSI_FLAG_PCI_MSI_MASK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87jyxqiuta.ffs@tglx>
Date: Fri, 09 Jan 2026 18:01:53 +0100
From: Thomas Gleixner <tglx@...nel.org>
To: Luigi Rizzo <lrizzo@...gle.com>
Cc: Marc Zyngier <maz@...nel.org>, bhelgaas@...gle.com,
 linux-kernel@...r.kernel.org
Subject: Re: [patch 1/2] irqchip/msi-lib: Honor the
 MSI_FLAG_PCI_MSI_MASK_PARENT flag

On Fri, Jan 09 2026 at 14:00, Luigi Rizzo wrote:
> On Fri, Jan 9, 2026 at 1:20 PM Thomas Gleixner <tglx@...nel.org> wrote:
>> That's not the point. The point is that _after_ the mask has reached the
>> device it is guaranteed that no further interrupts are generated until
>> unmask() is invoked. That's what is required e.g. for writing the MSI
>> message to the device.

Actually I have to correct myself. We stopped doing the readback on
mask() in pci_write_msg_msix() because the writes are guaranteed to be
ordered, so the device will see them in correct order:

         PCI_MSIX_ENTRY_VECTOR_CTRL  // mask
         PCI_MSIX_ENTRY_LOWER_ADDR   // MSI message address low
         PCI_MSIX_ENTRY_UPPER_ADDR   // MSI message address high
         PCI_MSIX_ENTRY_DATA         // MSI message data
         PCI_MSIX_ENTRY_VECTOR_CTRL  // unmask

which in turn satisfies the specification requirement that the MSIX
entry has to be masked in order to change the message:

  "If software changes the Address or Data value of an entry while the
   entry is unmasked, the result is undefined."

So yes, for the affinity change case, the read back is not required and
not done anymore. The read back was there in the initial implementation
and it turned out to be for paranoia sake.

It's still required to mask, but that's achieved through ordering.

> I keep hearing about this guarantee, which is surely true, but I argue that
> it is useless both during normal activity (when interrupts may be
> occasionally masked during affinity changes) and at shutdown
> (when there is a lot more additional state cleanup done on
> the device, which insures that there are no more interrupts,
> save the one(s) happening between the mask() write and readback,
> which we have no way to block, anyways (and may fire on the CPU at
> some unspecified time, even after the readback, because that does not
> control the pipeline from the device to the interrupt controller etc.),
> and that the kernel stack is well equipped to handle and ignore.

There is a difference and it has nothing to do with other device
cleanups at all. That's a pure interrupt management problem:

free_irq()
   1)   mask()
   2)   synchronize_irq()
   3)   invalidate vector

So if #1 does not guarantee that the interrupt is masked, then #2 is not
guaranteed to see a pending interrupt and #3 can invalidate the vector
before the interrupt becomes pending and raised in a CPU, which then
results in a spurious interrupt warning or in the worst case an IOMMU
exception. While that's "handled" by the kernel somewhat gracefully,
it's not ignored because it's invalid state.

The readback after mask() guarantees that an interrupt which was raised
_before_ the mask write reached the device has arrived at its
destination and is therefore detectable as pending or in progress
_before_ the vector is invalidated. That's simply because the interrupt
message preceeds the read reply and the CPU is stalled until the read
comes back.

> The only, weak, justification that Gemini gives for masking is that

Groan....

> Again, I have seen no explanation so far on why the mask requires a
> guarantee, and as I have tried to explain, the unavoidable stray
> interrupts generated before the readback are as bad (or actually as
> harmless) as the ones that we avoid with the readback.

Stray interrupts are only harmless when the interrupt is in a functional
state, but _not_ when the required resources are not longer available.

So instead of this unholy per CPU state and penalizing the VIRT case,
which Marc is concerned off, something like the uncompiled below should
just work. Then you can use these new callbacks for your mitigation muck
and everything else is unaffected.

Though this mitigation stuff will have serious implications when running
in a VM because the MMIO write will be intercepted, but that's a
different story.

What might be worth to investigate is whether the irq_mask() callback
for MSI-X can use the lazy mechanism (without readback). But that's an
orthogonal issue and needs some thoughts and testing.

Thanks,

        tglx
---
--- a/drivers/pci/msi/irqdomain.c
+++ b/drivers/pci/msi/irqdomain.c
@@ -162,6 +162,11 @@ static void pci_irq_mask_msix(struct irq
 	pci_msix_mask(irq_data_get_msi_desc(data));
 }
 
+static void pci_irq_mask_msix_lazy(struct irq_data *data)
+{
+	pci_msix_mask_lazy(irq_data_get_msi_desc(data));
+}
+
 static void pci_irq_unmask_msix(struct irq_data *data)
 {
 	pci_msix_unmask(irq_data_get_msi_desc(data));
@@ -182,7 +187,9 @@ static const struct msi_domain_template
 		.irq_startup		= pci_irq_startup_msix,
 		.irq_shutdown		= pci_irq_shutdown_msix,
 		.irq_mask		= pci_irq_mask_msix,
+		.irq_mask_dev		= pci_irq_mask_msix_lazy,
 		.irq_unmask		= pci_irq_unmask_msix,
+		.irq_unmask_dev		= pci_irq_unmask_msix,
 		.irq_write_msi_msg	= pci_msi_domain_write_msg,
 		.flags			= IRQCHIP_ONESHOT_SAFE,
 	},
--- a/drivers/pci/msi/msi.h
+++ b/drivers/pci/msi/msi.h
@@ -40,10 +40,15 @@ static inline void pci_msix_write_vector
 		writel(ctrl, desc_addr + PCI_MSIX_ENTRY_VECTOR_CTRL);
 }
 
-static inline void pci_msix_mask(struct msi_desc *desc)
+static inline void pci_msix_mask_lazy(struct msi_desc *desc)
 {
 	desc->pci.msix_ctrl |= PCI_MSIX_ENTRY_CTRL_MASKBIT;
 	pci_msix_write_vector_ctrl(desc, desc->pci.msix_ctrl);
+}
+
+static inline void pci_msix_mask(struct msi_desc *desc)
+{
+	pci_msix_mask_lazy(desc);
 	/* Flush write to device */
 	readl(desc->pci.mask_base);
 }
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -451,7 +451,11 @@ static inline irq_hw_number_t irqd_to_hw
  * @irq_ack:		start of a new interrupt
  * @irq_mask:		mask an interrupt source
  * @irq_mask_ack:	ack and mask an interrupt source
+ * @irq_mask_dev:	Optional callback to mask at the device level for
+ *			interrupt moderation purposes. Only valid for the outermost
+ *			interrupt chip in a hierarchy.
  * @irq_unmask:		unmask an interrupt source
+ * @irq_unmask_dev:	Counterpart to @irq_mask_dev (required when @irq_mask_dev is not NULL)
  * @irq_eoi:		end of interrupt
  * @irq_set_affinity:	Set the CPU affinity on SMP machines. If the force
  *			argument is true, it tells the driver to
@@ -499,7 +503,9 @@ struct irq_chip {
 	void		(*irq_ack)(struct irq_data *data);
 	void		(*irq_mask)(struct irq_data *data);
 	void		(*irq_mask_ack)(struct irq_data *data);
+	void		(*irq_mask_dev)(struct irq_data *data);
 	void		(*irq_unmask)(struct irq_data *data);
+	void		(*irq_unmask_dev)(struct irq_data *data);
 	void		(*irq_eoi)(struct irq_data *data);
 
 	int		(*irq_set_affinity)(struct irq_data *data, const struct cpumask *dest, bool force);