lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 15 Feb 2023 09:39:13 -0600
From:   Bjorn Helgaas <helgaas@...nel.org>
To:     Baolu Lu <baolu.lu@...ux.intel.com>
Cc:     Felix Kuehling <felix.kuehling@....com>,
        "Deucher, Alexander" <Alexander.Deucher@....com>,
        "Hegde, Vasant" <Vasant.Hegde@....com>,
        Matt Fagnani <matt.fagnani@...l.net>,
        Thorsten Leemhuis <regressions@...mhuis.info>,
        Joerg Roedel <jroedel@...e.de>,
        Jason Gunthorpe <jgg@...dia.com>,
        "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        LKML <linux-kernel@...r.kernel.org>,
        "regressions@...ts.linux.dev" <regressions@...ts.linux.dev>,
        Linux PCI <linux-pci@...r.kernel.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Christian König <christian.koenig@....com>,
        "Pan, Xinhui" <Xinhui.Pan@....com>, amd-gfx@...ts.freedesktop.org
Subject: Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled

[+cc Christian, Xinhui, amd-gfx]

On Fri, Jan 06, 2023 at 01:48:11PM +0800, Baolu Lu wrote:
> On 1/5/23 11:27 PM, Felix Kuehling wrote:
> > Am 2023-01-05 um 09:46 schrieb Deucher, Alexander:
> > > > -----Original Message-----
> > > > From: Hegde, Vasant <Vasant.Hegde@....com>
> > > > On 1/5/2023 4:07 PM, Baolu Lu wrote:
> > > > > On 2023/1/5 18:27, Vasant Hegde wrote:
> > > > > > On 1/5/2023 6:39 AM, Matt Fagnani wrote:
> > > > > > > I built 6.2-rc2 with the patch applied. The same black
> > > > > > > screen problem happened with 6.2-rc2 with the patch. I
> > > > > > > tried to use early kdump with 6.2-rc2 with the patch
> > > > > > > twice by panicking the kernel with sysrq+alt+c after the
> > > > > > > black screen happened. The system rebooted after about
> > > > > > > 10-20 seconds both times, but no kdump and dmesg files
> > > > > > > were saved in /var/crash. I'm attaching the lspci -vvv
> > > > > > > output as requested. ...

> > > > > > Looking into lspci output, it doesn't list ACS feature
> > > > > > for Graphics card. So with your fix it didn't enable PASID
> > > > > > and hence it failed to boot. ...

> > > > > So do you mind telling why does the PASID need to be enabled
> > > > > for the graphic device? Or in another word, what does the
> > > > > graphic driver use the PASID for? ...

> > > The GPU driver uses the pasid for shared virtual memory between
> > > the CPU and GPU.  I.e., so that the user apps can use the same
> > > virtual address space on the GPU and the CPU.  It also uses
> > > pasid to take advantage of recoverable device page faults using
> > > PRS. ...

> > Agreed. This applies to GPU computing on some older AMD APUs that
> > take advantage of memory coherence and IOMMUv2 address translation
> > to create a shared virtual address space between the CPU and GPU.
> > In this case it seems to be a Carrizo APU. It is also true for
> > Raven APUs. ...

> Thanks for the explanation.
> 
> This is actually the problem that commit 201007ef707a was trying to
> fix.  The PCIe fabric routes Memory Requests based on the TLP
> address, ignoring any PASID (PCIe r6.0, sec 2.2.10.4), so a TLP with
> PASID that should go upstream to the IOMMU may instead be routed as
> a P2P Request if its address falls in a bridge window.
> 
> In SVA case, the IOMMU shares the address space of a user
> application.  The user application side has no knowledge about the
> PCI bridge window.  It is entirely possible that the device is
> programed with a P2P address and results in a disaster.

Is this stalled?  We explored the idea of changing the PCI core so
that for devices that use ATS/PRI, we could enable PASID without
checking for ACS [1], but IIUC we ultimately concluded that it was
based on a misunderstanding of how ATS Translation Requests are routed
and that an AMD driver change would be required [2].

So it seems like we still have this regression, and we're running out
of time before v6.2.

[1] https://lore.kernel.org/all/20230114073420.759989-1-baolu.lu@linux.intel.com/
[2] https://lore.kernel.org/all/Y91X9MeCOsa67CC6@nvidia.com/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ