[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <3a045365-3cd0-408e-a366-4c81a1b60cbb@huawei.com>
Date: Thu, 15 Jan 2026 10:50:29 +0800
From: duziming <duziming2@...wei.com>
To: Bjorn Helgaas <helgaas@...nel.org>
CC: <bhelgaas@...gle.com>, <okaya@...nel.org>, <keith.busch@...el.com>,
<linux-pci@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<liuyongqiang13@...wei.com>
Subject: Re: [PATCH] PCI: Fix AB-BA deadlock between aer_isr() and
device_shutdown()
在 2026/1/14 2:51, Bjorn Helgaas 写道:
> On Fri, Jan 09, 2026 at 05:56:03PM +0800, Ziming Du wrote:
>> During system shutdown, a deadlock may occur between AER recovery process
>> and device shutdown as follows:
>>
>> The device_shutdown path holds the device_lock throughout the entire
>> process and waits for the irq handlers to complete when release nodes:
>>
>> device_shutdown
>> device_lock # A hold device_lock
>> pci_device_shutdown
>> pcie_port_device_remove
>> remove_iter
>> device_unregister
>> device_del
>> bus_remove_device
>> device_release_driver
>> devres_release_all
>> release_nodes # B wait for irq handlers
> Can you add the wait location to these example? release_nodes()
> doesn't wait itself, so I guess it must be in a dr->node.release()
> function?
>
> And I guess it must be related to something in the IRQ path that is
> held while aer_isr() runs?
When releasing the interrupt resources, the process eventually calls
free_irq(), and then
__synchronize_irq () will be called to wait until all irq handlers have
finished.
>> The aer_isr path will acquire device_lock in pci_bus_reset():
>>
>> aer_isr # B execute irq process
>> aer_isr_one_error
>> aer_process_err_devices
>> handle_error_source
>> pcie_do_recovery
>> aer_root_reset
>> pci_bus_error_reset
>> pci_bus_reset # A acquire device_lock
>>
>> The circular dependency causes system hang. Fix it by using
>> pci_bus_trylock() instead of pci_bus_lock() in pci_bus_reset(). When the
>> lock is unavailable, return -EAGAIN, as in similar cases.
> pci_bus_error_reset() may use either pci_slot_reset() or
> pci_bus_reset(), and this patch addresses only pci_bus_reset(). Is
> the same deadlock possible in the pci_slot_reset() path?
Looking at the code flow, I agree that there is likely a potential issue
here.
Unfortunately, my current test environment does not support slot_reset, so
I haven't been able to reproduce this specific scenario locally. It would be
incredibly helpful if someone with a compatible setup could help verify
or reproduce this behavior.
>> Fixes: c4eed62a2143 ("PCI/ERR: Use slot reset if available")
>> Signed-off-by: Ziming Du <duziming2@...wei.com>
>> ---
>> drivers/pci/pci.c | 17 ++++++++++++-----
>> 1 file changed, 12 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 13dbb405dc31..7471bfa6f32e 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -5515,15 +5515,22 @@ static int pci_bus_reset(struct pci_bus *bus, bool probe)
>> if (probe)
>> return 0;
>>
>> - pci_bus_lock(bus);
>> + /*
>> + * Replace blocking lock with trylock to prevent deadlock during bus reset.
>> + * Same as above except return -EAGAIN if the bus cannot be locked.
> Wrap this to fit in 80 columns like the rest of the file.
>
>> + */
>> + if (pci_bus_trylock(bus)) {
>>
>> - might_sleep();
>> + might_sleep();
>>
>> - ret = pci_bridge_secondary_bus_reset(bus->self);
>> + ret = pci_bridge_secondary_bus_reset(bus->self);
>>
>> - pci_bus_unlock(bus);
>> + pci_bus_unlock(bus);
>>
>> - return ret;
>> + return ret;
>> + }
>> +
>> + return -EAGAIN;
>> }
>>
>> /**
>> --
>> 2.43.0
>>
Powered by blists - more mailing lists