[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f9f2114-531f-4fd9-92a7-24c2a311938e@google.com>
Date: Fri, 4 Jul 2025 23:36:50 +0800
From: Liangyan <liangyan.peng@...edance.com>
To: Thomas Gleixner <tglx@...utronix.de>,
Liangyan <liangyan.peng@...edance.com>
Cc: linux-kernel@...r.kernel.org, Yicong Shen
<shenyicong.1023@...edance.com>, ziqianlu@...edance.com,
songmuchun@...edance.com, yuanzhu@...edance.com
Subject: Re: [External] Re: [RFC] genirq: Fix lockup in handle_edge_irq
On 2025/7/4 22:42, Thomas Gleixner wrote:
> Liangyan!
>
> Please don't top post and trim your reply. See:
>
> https://people.kernel.org/tglx/notes-about-netiquette
Got it, thanks for the guidance, Thomas!
>
> for further explanation.
>
> On Thu, Jul 03 2025 at 23:31, Liangyan wrote:
>> We have this softlockup issue in guest vm, so the related IRQ is from
>> virtio-net tx queue, the interrupt controller is virt pci msix
>> controller, related components have pci_msi_controller, virtio_pci,
>> virtio_net and qemu.
>
> That's a random list of pieces, which are not necessarily related to the
> interrupt control flow. You have to look at the actual interrupt domain
> hierarchy of the interrupt in question. /sys/kernel/debug/irq/irqs/$N.
>
>> And according to qemu msix.c source code, when irq is unmasked, it will
>> fire new one if the msix pending bit is set.
>> Seems that for msi-x controller, it will not lose interrupt during
>> unmask period.
>
> That's correct and behaving according to specification. Though
> unfortunately not all PCI-MSI-X implementations are specification
> compliant, so we can't do that unconditionally. There is also no way to
> detect whether there is a sane implementation in the hardware
> [emulation] or not.
>
> So playing games with the unmask is not really feasible. But let's take
> a step back and look at the actual problem.
>
> It only happens when the interrupt affinity is moved or the interrupt
> has multiple target CPUs enabled in the effective affinity mask. x86 and
> arm64 enforce the effective affinity to be a single CPU, so on those
> architectures the problem only arises when the interrupt affinity
> changes.
>
> Now we can use that fact and check whether the CPU, which observes
> INPROGRESS, is the target CPU in the effective affinity mask. If so,
> then the obvious cure is to busy poll the INPROGRESS flag instead of
> doing the mask()/PENDING/unmask() dance.
>
> Something like the uncompiled and therefore untested patch below should
> do the trick. If you find bugs in it, you can keep and fix them :)
Great, thanks for the patch, I will test it and feedback later.
Regards,
Liangyan
Powered by blists - more mailing lists