lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <525730fd-5982-fea7-b6d5-2da69f225f04@amd.com>
Date:   Tue, 10 Jan 2023 21:38:13 +0530
From:   Vasant Hegde <vasant.hegde@....com>
To:     Matt Fagnani <matt.fagnani@...l.net>,
        Baolu Lu <baolu.lu@...ux.intel.com>,
        Thorsten Leemhuis <regressions@...mhuis.info>
Cc:     Joerg Roedel <jroedel@...e.de>,
        "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        LKML <linux-kernel@...r.kernel.org>,
        "regressions@...ts.linux.dev" <regressions@...ts.linux.dev>,
        Linux PCI <linux-pci@...r.kernel.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: [regression, bisected, pci/iommu] BugĀ 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled

Matt,


On 1/6/2023 12:58 PM, Matt Fagnani wrote:
> I booted 6.2-rc2 + patch with rd.driver.blacklist=amdgpu on the kernel command
> line to prevent amdgpu from being started while the initramfs was in use. The
> black screen problem happened later in the boot. I pressed sysrq+alt+s,u,b to do
> an emergency sync, remount read-only, and reboot. The journal for that boot was
> shown on the next boot. The two warnings which I previously reported weren't
> shown in the journal, but the same null pointer dereference which made amdgpu
> crash happened. I'm attaching the kernel log from the journal of that boot.
> 

Thanks for your effort to get boot log. This is helpful.

Looking into the code further,
  iommu_detach_group() didn't attach devices back to default_domain. So IOMMU
point of view device group was left in inconsistent state. This resulted in
IOMMU throwing page fault errors and amd IOMMU event handler code always assumes
that domain is setup properly. That resulted in below NULL pointer dereference
issue.

  Jan 06 02:07:52 kernel: BUG: kernel NULL pointer dereference, address:
0000000000000058
  Jan 06 02:07:52 kernel: #PF: supervisor read access in kernel mode
  Jan 06 02:07:53 kernel: #PF: error_code(0x0000) - not-present page
  Jan 06 02:07:53 kernel: PGD 0 P4D 0
  Jan 06 02:07:53 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
  Jan 06 02:07:53 kernel: CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Not tainted
6.2.0-rc2+ #89
  Jan 06 02:07:53 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52
12/03/2019
  Jan 06 02:07:53 kernel: RIP: 0010:report_iommu_fault+0x11/0x90

Ideally if domain attach fails (in this case its because pasid capability check
returned error) we should put devices back to original domain.. so that it can
continue without PASID capability.

I have a patch to handle these error conditions (not the fix for original
issue). I will try to post it soon.

-Vasant

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ