lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAFv23Q=5KdqDHYxf9PVO=kq=VqP0LwRaHQ-KnY2taDEkZ9Fueg@mail.gmail.com>
Date: Fri, 27 Sep 2024 15:33:50 +0800
From: AceLan Kao <acelan.kao@...onical.com>
To: Lukas Wunner <lukas@...ner.de>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>, 
	linux-pci@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] PCI: pciehp: Fix system hang on resume after hot-unplug
 during suspend

Lukas Wunner <lukas@...ner.de> 於 2024年9月26日 週四 下午9:23寫道:
>
> On Thu, Sep 26, 2024 at 08:59:09PM +0800, Chia-Lin Kao (AceLan) wrote:
> > Remove unnecessary pci_walk_bus() call in pciehp_resume_noirq(). This
> > fixes a system hang that occurs when resuming after a Thunderbolt dock
> > with attached thunderbolt storage is unplugged during system suspend.
> >
> > The PCI core already handles setting the disconnected state for devices
> > under a port during suspend/resume.
>
> Please explain in the commit message where the PCI core does that.
Hi Lukas,

I found my patch doesn't work.
Let me explain the new findings below.

>
> > The redundant bus walk was
> > interfering with proper hardware state detection during resume, causing
> > a system hang when hot-unplugging daisy-chained Thunderbolt devices.
>
> Please explain what "proper hardware state detection" means.
>
> Did you get a hung task stacktrace?  If so, please include it in the
> commit message.
>
> If you're getting a system hang, it means there's a deadlock
> involving pci_bus_sem.  I don't quite see how that could happen,
> so a more verbose explanation would be appreciated.
I have no good answer for you now.
After enabling some debugging options and debugging lock options, I
still didn't get any message.
It just hangs there while resuming, and nothing could be logged.

Here is my setup
ubuntu@...alhost:~$ lspci -tv
-[0000:00]-+-00.0  Intel Corporation Device 6400
          +-02.0  Intel Corporation Lunar Lake [Intel Graphics]
          +-04.0  Intel Corporation Device 641d
          +-05.0  Intel Corporation Device 645d
          +-07.0-[01-38]--
          +-07.2-[39-70]----00.0-[3a-70]--+-00.0-[3b]--
          |                               +-01.0-[3c-4d]--
          |
+-02.0-[4e-5f]----00.0-[4f-50]----01.0-[50]----00.0  Phison
Electronics Corporation E12 NVMe Controlle
r
          |                               +-03.0-[60-6f]--
          |                               \-04.0-[70]--

This is Dell WD22TB dock
39:00.0 PCI bridge [0604]: Intel Corporation Thunderbolt 4 Bridge
[Goshen Ridge 2020] [8086:0b26] (rev 03)
       Subsystem: Intel Corporation Thunderbolt 4 Bridge [Goshen Ridge
2020] [8086:0000]
       Kernel driver in use: pcieport

This is the TBT storage connects to the dock
50:00.0 Non-Volatile memory controller [0108]: Phison Electronics
Corporation E12 NVMe Controller [1987:5012] (rev 01)
       Subsystem: Phison Electronics Corporation E12 NVMe Controller [1987:5012]
       Kernel driver in use: nvme
       Kernel modules: nvme

While resuming, the dock(8086:0b26) resumes first and I found if dock
device run through below 2 functions in pciehp_resume_noirq()
    if (pciehp_device_replaced(ctrl)) {
        pci_walk_bus(ctrl->pcie->port->subordinate,pci_dev_set_disconnected,
NULL);
        pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
    }
The system hangs when storage(1987:5012) also calls the 2 functions.
Only one of the 2 devices can enter the if statement, dock or storage,
or the system hangs.

To test it more, I found that mask out pciehp_request() eases the issue.
It may be because it calls irq_wake_thread() and is stuck somewhere.

So, I try to get rid of the irq_wake_thread() by replacing
   pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
with
   atomic_or(PCI_EXP_SLTSTA_PDC, &ctrl->pending_events);
In this case, only dock(8086:0b26) will be called in pciehp_resume_noirq().
And the tbt storage will be released after pci_power_up() fails.

Do you think this is a feasible solution?

diff --git a/drivers/pci/hotplug/pciehp_core.c
b/drivers/pci/hotplug/pciehp_core.c
index ff458e692fed..56bf23d55c41 100644
--- a/drivers/pci/hotplug/pciehp_core.c
+++ b/drivers/pci/hotplug/pciehp_core.c
@@ -332,7 +332,7 @@ static int pciehp_resume_noirq(struct pcie_device *dev)
                       ctrl_dbg(ctrl, "device replaced during system sleep\n");
                       pci_walk_bus(ctrl->pcie->port->subordinate,
                                    pci_dev_set_disconnected, NULL);
-                       pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
+                       atomic_or(PCI_EXP_SLTSTA_PDC, &ctrl->pending_events);
               }
       }

>
> Thanks,
>
> Lukas
>
>
> > Fixes: 9d573d19547b ("PCI: pciehp: Detect device replacement during system sleep")
> > Signed-off-by: Chia-Lin Kao (AceLan) <acelan.kao@...onical.com>
> > ---
> >  drivers/pci/hotplug/pciehp_core.c | 2 --
> >  1 file changed, 2 deletions(-)
> >
> > diff --git a/drivers/pci/hotplug/pciehp_core.c b/drivers/pci/hotplug/pciehp_core.c
> > index ff458e692fed..c1c3f7e2bc43 100644
> > --- a/drivers/pci/hotplug/pciehp_core.c
> > +++ b/drivers/pci/hotplug/pciehp_core.c
> > @@ -330,8 +330,6 @@ static int pciehp_resume_noirq(struct pcie_device *dev)
> >                */
> >               if (pciehp_device_replaced(ctrl)) {
> >                       ctrl_dbg(ctrl, "device replaced during system sleep\n");
> > -                     pci_walk_bus(ctrl->pcie->port->subordinate,
> > -                                  pci_dev_set_disconnected, NULL);
> >                       pciehp_request(ctrl, PCI_EXP_SLTSTA_PDC);
> >               }
> >       }
> > --
> > 2.43.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ