[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAMOZA0K0NiuXCKA9zTwspmFFgbrB_Cq9Q9wa2tFjhq5aYk+S5A@mail.gmail.com>
Date: Sun, 21 Dec 2025 13:41:37 +0100
From: Luigi Rizzo <lrizzo@...gle.com>
To: Marc Zyngier <maz@...nel.org>
Cc: tglx@...utronix.de, bhelgaas@...gle.com, linux-kernel@...r.kernel.org
Subject: Re: [patch 1/2] irqchip/msi-lib: Honor the MSI_FLAG_PCI_MSI_MASK_PARENT
flag
On Sun, Dec 21, 2025 at 12:55 PM Marc Zyngier <maz@...nel.org> wrote:
>
> On Sat, 20 Dec 2025 19:31:19 +0000,
> Luigi Rizzo <lrizzo@...gle.com> wrote:
> >
> > There are platforms (including some ARM SoC) where the MSIx
> > writes are a performance killer, because they are exceedingly
> > serializing on the PCIe root port.
> >
> > These platforms are the key motivation for Global Software
> > Interrupt Moderation (GSIM) which relies on actually masking
> > device interrupts so the MSIx writes are not generated.
> > https://lore.kernel.org/all/20251217112128.1401896-1-lrizzo@google.com/
> >
> > Overriding mask/unmask with irq_chip_mask_parent() makes software
> > moderation ineffective. GSIM works great on ARM platforms before
> > this patch, but becomes ineffective afterwards, e.g. on linux 6.18.
>
> You do realise that "ARM platforms" means nothing at all, right? What
> you actually mean is "the ARM machines I have access to exhibit some
> platform-specific behaviour that may or may not be a general
> behaviour".
>
> Your particular circumstances are not in any way something you can
> generalise, unless you demonstrate this is caused by an architectural
> requirement rather than an implementation defect.
You are right, I should have been more precise and clarified "some arm machines
I have access to". Note though that the problem addressed by
https://lore.kernel.org/all/20251217112128.1401896-1-lrizzo@google.com/
is not for one broken snowflake. It affects multiple SoC families from
all vendors
(Intel, AMD, ARM), and is not new at all. Back in 2020 when Eric
Dumazet and I developed
napi_defer_hard_irqs to address this very problem on a specific platform (x86).
And sure, there are platforms that tolerate 30M intrs/s without a sweat.
Anyways. Systems are what they are, some have suboptimal
implementations which make certain operations more expensive
than they could be. We can just say "tough luck" and write them off
as broken, or try to mitigate the problem, and I am just exploring
how we can do the latter without harming common cases.
> > The round trip through the PCI endpoint for mask_irq(), caused by the
> > readback to make sure the PCI write has been sent, is almost always
> > (or really always) unnecessary. Masking is inherently racy; waiting
> > that the PCIe write has arrived at the device won't guarantee that an
> > interrupt has arrived in the meantime, so there is really no benefit
> > in the readback (which, for instance, can be conditionally removed with
> > code like the one below).
> >
> > I measured the cost of pci_irq_mask_msix() and it goes from 1000-1500ns
> > with the readl(), down to 40-50ns without it.
> >
> > Once we remove the costly readback, is there any remaining reason
> > to overwrite [un]mask_irq() with irq_chip_[un]mask_parent() ?
>
> So you are effectively not masking at all and just rely on hope
> instead. I have the utmost confidence in this sort of stuff. Totally.
I don't understand the above comment.
Masking happens as a result of the PCIe write,
which will eventually reach the device. The presence of the
readback does nothing to accelerate the landing of the write.
If the expectation was "after the readl() there are no interrupts",
that is incorrect, because one may have been generated before
the mask landed, be in flight in the interrupt controller,
and fire after the readl() completes.
> What you missing is that hitting the config space is causing pretty
> high overhead in KVM guests, where the accesses (write and read to the
> MSI masks) are trapped all the way to userspace (and back into VFIO),
> while the masking at the ITS level is much cheaper.
>
> Masking at the ITS level (and only there) also means that the VM can
> be migrated without having to worry about the PBA in each device,
> because the pending state is already part of the VM's memory, nicely
> tucked away in the RD tables.
Good point about the guest. Regardless, without actually masking the
PCIe interrupts, on certain platforms performance really collapses
so one may have to choose the lesser evil.
My goal is to see if there is a way to optimize where it makes sense:
- first, how often do we actually call mask_irq() outside moderation ?
it seems to happen only does it when the interrupt migrates to a
different CPU, and handle_fasteoi_irq() for IRQS_ONESHOT
- second, how expensive is it to do the pci mask?
You make a good point about the guest, but for the host case,
I still think we are paying the readback cost unnecessarily.
- third, how expensive is it to NOT do the mask?
For this I did indicate data for large servers of all kinds that suffer.
https://lore.kernel.org/all/20251217112128.1401896-1-lrizzo@google.com/
Given all the above, I think there is a case for having this optimization
(ignore device mask, just rely on interrupt controller)
configurable at least as a CONFIG option.
>
> Finally, it aligns PCI devices with non-PCI device behaviour,
> something that is highly desirable.
Sure, generally this is a good thing; except that what I was
pointing out is a peculiar PCI flaw on many platforms which,
unfortunate as it may be, requires a specific mitigation.
cheers
luigi
>
> For me, that totally beats your interrupt mitigation thing.
>
> Thanks,
>
> M.
>
> --
> Jazz isn't dead. It just smells funny.
Powered by blists - more mailing lists