[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <874izo3x60.ffs@tglx>
Date: Thu, 20 Mar 2025 09:48:23 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Wen Xiong <wenxiong@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, gjoyce@...ux.ibm.com,
linux-pci@...r.kernel.org, Bjorn Helgaas <helgaas@...nel.org>,
linux-scsi@...r.kernel.org
Subject: Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits
EEH
On Thu, Mar 20 2025 at 09:23, Thomas Gleixner wrote:
> On Wed, Mar 19 2025 at 21:58, Wen Xiong wrote:
>> We don't see the issue without dynamically remove/add operation.
>> There is a small window which irqbalance daemon kicks in during device
>> reset. So it took about over 6 hours to recreate the issue when doing
>> remove/add loop operation.
>
> Sure. You need a loop to hit the window. And it does not matter whether
> it's the probe or the remove which triggers it. Fact is that the reset
> wipes out the config space, which means that any read from the config
> space between reset and restore will return garbage. That problem is not
> restricted to the interrupt code. It's a general problem.
After looking at the code again, it's a problem in the remove()
function:
__ipr_remove()
ipr_initiate_ioa_bringdown()
// resets device
restore_config_space()
....
ipr_free_all_resources()
free_irqs()
So yes, it's not probe(). But the question is pretty much the same.
Why is a reset issued while the driver is fully operational and
resources are still in use?
Don't even think about telling me that this is a problem of the MSI
interrupt rework. It is not. It's been broken forever.
You _cannot_ pull the rung under a fully operational driver and expect
that all involved parts will just magically handle this gracefully.
What about tearing down resources first and then issuing the reset?
Thanks,
tglx
Powered by blists - more mailing lists