[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <525730fd-5982-fea7-b6d5-2da69f225f04@amd.com>
Date: Tue, 10 Jan 2023 21:38:13 +0530
From: Vasant Hegde <vasant.hegde@....com>
To: Matt Fagnani <matt.fagnani@...l.net>,
Baolu Lu <baolu.lu@...ux.intel.com>,
Thorsten Leemhuis <regressions@...mhuis.info>
Cc: Joerg Roedel <jroedel@...e.de>,
"iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
LKML <linux-kernel@...r.kernel.org>,
"regressions@...ts.linux.dev" <regressions@...ts.linux.dev>,
Linux PCI <linux-pci@...r.kernel.org>,
Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: [regression, bisected, pci/iommu] BugĀ 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled
Matt,
On 1/6/2023 12:58 PM, Matt Fagnani wrote:
> I booted 6.2-rc2 + patch with rd.driver.blacklist=amdgpu on the kernel command
> line to prevent amdgpu from being started while the initramfs was in use. The
> black screen problem happened later in the boot. I pressed sysrq+alt+s,u,b to do
> an emergency sync, remount read-only, and reboot. The journal for that boot was
> shown on the next boot. The two warnings which I previously reported weren't
> shown in the journal, but the same null pointer dereference which made amdgpu
> crash happened. I'm attaching the kernel log from the journal of that boot.
>
Thanks for your effort to get boot log. This is helpful.
Looking into the code further,
iommu_detach_group() didn't attach devices back to default_domain. So IOMMU
point of view device group was left in inconsistent state. This resulted in
IOMMU throwing page fault errors and amd IOMMU event handler code always assumes
that domain is setup properly. That resulted in below NULL pointer dereference
issue.
Jan 06 02:07:52 kernel: BUG: kernel NULL pointer dereference, address:
0000000000000058
Jan 06 02:07:52 kernel: #PF: supervisor read access in kernel mode
Jan 06 02:07:53 kernel: #PF: error_code(0x0000) - not-present page
Jan 06 02:07:53 kernel: PGD 0 P4D 0
Jan 06 02:07:53 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 06 02:07:53 kernel: CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Not tainted
6.2.0-rc2+ #89
Jan 06 02:07:53 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52
12/03/2019
Jan 06 02:07:53 kernel: RIP: 0010:report_iommu_fault+0x11/0x90
Ideally if domain attach fails (in this case its because pasid capability check
returned error) we should put devices back to original domain.. so that it can
continue without PASID capability.
I have a patch to handle these error conditions (not the fix for original
issue). I will try to post it soon.
-Vasant
Powered by blists - more mailing lists