lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <874izo3x60.ffs@tglx>
Date: Thu, 20 Mar 2025 09:48:23 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Wen Xiong <wenxiong@...ux.ibm.com>
Cc: linux-kernel@...r.kernel.org, gjoyce@...ux.ibm.com,
 linux-pci@...r.kernel.org, Bjorn Helgaas <helgaas@...nel.org>,
 linux-scsi@...r.kernel.org
Subject: Re: [PATCH 1/1] genirq/msi: Dynamic remove/add stroage adapter hits
 EEH

On Thu, Mar 20 2025 at 09:23, Thomas Gleixner wrote:
> On Wed, Mar 19 2025 at 21:58, Wen Xiong wrote:
>> We don't see the issue without dynamically remove/add operation.
>> There is a small window which irqbalance daemon kicks in during device
>> reset. So it took about over 6 hours to recreate the issue when doing
>> remove/add loop operation.
>
> Sure. You need a loop to hit the window. And it does not matter whether
> it's the probe or the remove which triggers it. Fact is that the reset
> wipes out the config space, which means that any read from the config
> space between reset and restore will return garbage. That problem is not
> restricted to the interrupt code. It's a general problem.

After looking at the code again, it's a problem in the remove()
function:

__ipr_remove()
  ipr_initiate_ioa_bringdown() 
    // resets device
    restore_config_space()
  ....
  ipr_free_all_resources()
    free_irqs()

So yes, it's not probe(). But the question is pretty much the same.

Why is a reset issued while the driver is fully operational and
resources are still in use?

Don't even think about telling me that this is a problem of the MSI
interrupt rework. It is not. It's been broken forever.

You _cannot_ pull the rung under a fully operational driver and expect
that all involved parts will just magically handle this gracefully.

What about tearing down resources first and then issuing the reset?

Thanks,

        tglx


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ