linux-kernel - Re: [External] Re: [RFC] genirq: Fix lockup in handle_edge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f3608ef2-1d9f-406c-92f3-fa69486e1644@google.com>
Date: Thu, 3 Jul 2025 23:31:23 +0800
From: Liangyan <liangyan.peng@...edance.com>
To: Thomas Gleixner <tglx@...utronix.de>
Cc: linux-kernel@...r.kernel.org, Yicong Shen
 <shenyicong.1023@...edance.com>, ziqianlu@...edance.com,
 songmuchun@...edance.com, yuanzhu@...edance.com
Subject: Re: [External] Re: [RFC] genirq: Fix lockup in handle_edge_irq

Hello Thomas,

We have this softlockup issue in guest vm, so the related IRQ is from 
virtio-net tx queue, the interrupt controller is virt pci msix 
controller, related components have pci_msi_controller, virtio_pci, 
virtio_net and qemu.

And according to qemu msix.c source code, when irq is unmasked, it will 
fire new one if the msix pending bit is set.
Seems that for msi-x controller, it will not lose interrupt during 
unmask period.

For this virt MSIX controller, do you have some suggestion? Thanks.

Regards,
Liangyan


On 2025/7/2 21:17, Thomas Gleixner wrote:
> On Wed, Jul 02 2025 at 00:35, Liangyan wrote:
>>   void handle_edge_irq(struct irq_desc *desc)
>>   {
>> +	bool need_unmask = false;
>> +
>>   	guard(raw_spinlock)(&desc->lock);
>>   
>>   	if (!irq_can_handle(desc)) {
>> @@ -791,12 +793,16 @@ void handle_edge_irq(struct irq_desc *desc)
>>   		if (unlikely(desc->istate & IRQS_PENDING)) {
>>   			if (!irqd_irq_disabled(&desc->irq_data) &&
>>   			    irqd_irq_masked(&desc->irq_data))
>> -				unmask_irq(desc);
>> +				need_unmask = true;
>>   		}
>>   
>>   		handle_irq_event(desc);
>>   
>>   	} while ((desc->istate & IRQS_PENDING) && !irqd_irq_disabled(&desc->irq_data));
>> +
>> +	if (need_unmask && !irqd_irq_disabled(&desc->irq_data) &&
>> +	    irqd_irq_masked(&desc->irq_data))
>> +		unmask_irq(desc);
> 
> This might work in your setup by some definition of "works", but it
> breaks the semantics of this handler because of this:
> 
> device interrupt        CPU0                            CPU1
>                          handle_edge_irq()
>                          set(INPROGRESS);
> 
>                          do {
>                                 handle_event();
> 
> device interrupt
>                                                          handle_edge_irq()
>                                                             if (INPROGRESS) {
>                                                               set(PENDING);
>                                                               mask();
>                                                               return;
>                                                             }
> 
>                                 ...
>                                 if (PENDING) {
>                                    need_unmask = true;
>                                 }
>                                 handle_event();
> 
> device interrupt   << possible FAIL
> 
> because there are enough edge type interrupt controllers out there which
> lose an edge when the line is masked at the interrupt controller
> level. As edge type interrupts are fire and forget from the device
> perspective, the interrupt is not retriggered when unmasking later.
> 
> That's the reason why this handler is written the way it is and this
> cannot be changed for obvious reasons.
> 
> So no, this is not going to happen.
> 
> The only possible solution for this is to analyze all interrupt
> controllers, which are involved in the delivery chain, and establish
> whether they are affected by the above problem. If not, then that
> particular delivery chain combination of interrupt controllers can be
> changed to use a different flow handler along with a profound
> explanation why this is correct under all circumstances.
> 
> As you failed to provide any information about the involved controllers,
> I cannot even give any hint about a possible solution.
> 
> Thanks,
> 
>          tglx
> 
>