lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <PSXP216MB04384F89D9D9DDA6999347CF805B0@PSXP216MB0438.KORP216.PROD.OUTLOOK.COM>
Date:   Tue, 10 Dec 2019 12:00:23 +0000
From:   Nicholas Johnson <nicholas.johnson-opensource@...look.com.au>
To:     "mika.westerberg@...ux.intel.com" <mika.westerberg@...ux.intel.com>
CC:     Bjorn Helgaas <helgaas@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>
Subject: Re: Linux v5.5 serious PCI bug

On Tue, Dec 10, 2019 at 09:28:00AM +0200, mika.westerberg@...ux.intel.com wrote:
> On Mon, Dec 09, 2019 at 01:33:49PM +0000, Nicholas Johnson wrote:
> > On Mon, Dec 09, 2019 at 03:12:39PM +0200, mika.westerberg@...ux.intel.com wrote:
> > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > Hi,
> > > > 
> > > > I have compiled Linux v5.5-rc1 and thought all was good until I 
> > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the 
> > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the 
> > > > GPU driver.
> > > > 
> > > > We had:
> > > > - kernel NULL pointer dereference
> > > > - refcount_t: underflow; use-after-free.
> > > > 
> > > > Attaching dmesg for now; will bisect and come back with results.
> > > 
> > > Looks like something related to iommu. Does it work if you disable it?
> > > (intel_iommu=off in the command line).
> > On Mon, Dec 09, 2019 at 03:12:39PM +0200, mika.westerberg@...ux.intel.com wrote:
> > > On Mon, Dec 09, 2019 at 12:34:04PM +0000, Nicholas Johnson wrote:
> > > > Hi,
> > > > 
> > > > I have compiled Linux v5.5-rc1 and thought all was good until I 
> > > > hot-removed a Gigabyte Aorus eGPU from Thunderbolt. The driver for the 
> > > > GPU was not loaded (blacklisted) so the crash is nothing to do with the 
> > > > GPU driver.
> > > > 
> > > > We had:
> > > > - kernel NULL pointer dereference
> > > > - refcount_t: underflow; use-after-free.
> > > > 
> > > > Attaching dmesg for now; will bisect and come back with results.
> > > 
> > > Looks like something related to iommu. Does it work if you disable it?
> > > (intel_iommu=off in the command line).
> > I thought it could be that, too.
> > 
> > The attachment "dmesg-4" from the original email is with iommu parameters.
> > The attachment "dmesg-5" from the original email is with no iommu parameters.
> > Attaching here "dmesg-6" with the iommu explicitly set off like you said.
> > 
> > No difference, still broken. Although, with iommu off, there are less stack traces.
> > 
> > Could it be sysfs-related?
> 
> Bisect would probably be the best option to find the culprit commit.
> There are couple of commits done for pciehp so reverting them one by one
> may help as well:
> 
>   87d0f2a5536f PCI: pciehp: Prevent deadlock on disconnect
>   75fcc0ce72e5 PCI: pciehp: Do not disable interrupt twice on suspend
>   b94ec12dfaee PCI: pciehp: Refactor infinite loop in pcie_poll_cmd()
>   157c1062fcd8 PCI: pciehp: Avoid returning prematurely from sysfs requests
You are not going to believe this. The offending commit is in the SOUND 
subsystem. I thought I had messed up the bisect when only sound commits 
were showing near the end.

And yes, I double checked.

Reverted, compiled, tested that it started working.
Reapplied, compiled, tested that it stopped working.
Twice.

The following is the culprit responsible for the issues:

commit 586bc4aab878efcf672536f0cdec3d04b6990c94
Author: Alex Deucher <alexander.deucher@....com>
Date:   Fri Nov 22 16:43:50 2019 -0500

    ALSA: hda/hdmi - fix vgaswitcheroo detection for AMD

It is playing with PCI devices. Clearly they did not consider 
hot-removal. I am guessing it is seeing the audio PCI func of the AMD 
card in that Thunderbolt eGPU enclosure.

I will collect information, make a bugzilla report and contact the AMD 
team. If anybody wants to be cc'd in then let me know. Sorry for 
bothering you and Bjorn with something which actually has nothing 
directly to do with the PCI subsystem or Thunderbolt.

I strongly hope that the upcoming Intel Xe GPU driver allows for 
surprise-removal in Linux without any crashing of kernel or userspace. 
The amdgpu and nouveau drivers do not take to surprise removal kindly, 
even without the above sound bug applying to AMD.

Kind regards,
Nicholas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ