linux-kernel - Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d6342073-132a-4bdd-e1cb-b14f972b61c8@amd.com>
Date:   Tue, 10 Jan 2023 21:42:26 +0530
From:   Vasant Hegde <vasant.hegde@....com>
To:     Matt Fagnani <matt.fagnani@...l.net>,
        Baolu Lu <baolu.lu@...ux.intel.com>,
        Thorsten Leemhuis <regressions@...mhuis.info>
Cc:     Joerg Roedel <jroedel@...e.de>,
        "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        LKML <linux-kernel@...r.kernel.org>,
        "regressions@...ts.linux.dev" <regressions@...ts.linux.dev>,
        Linux PCI <linux-pci@...r.kernel.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>
Subject: Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled



On 1/10/2023 9:38 PM, Vasant Hegde wrote:
> Matt,
> 
> 
> On 1/6/2023 12:58 PM, Matt Fagnani wrote:
>> I booted 6.2-rc2 + patch with rd.driver.blacklist=amdgpu on the kernel command
>> line to prevent amdgpu from being started while the initramfs was in use. The
>> black screen problem happened later in the boot. I pressed sysrq+alt+s,u,b to do
>> an emergency sync, remount read-only, and reboot. The journal for that boot was
>> shown on the next boot. The two warnings which I previously reported weren't
>> shown in the journal, but the same null pointer dereference which made amdgpu
>> crash happened. I'm attaching the kernel log from the journal of that boot.
>>
> 
> Thanks for your effort to get boot log. This is helpful.
> 
> Looking into the code further,
>   iommu_detach_group() didn't attach devices back to default_domain.

... because iommu_detach_group() expects new domain should be different from
group->domain.

-Vasant


> So IOMMU
> point of view device group was left in inconsistent state. This resulted in
> IOMMU throwing page fault errors and amd IOMMU event handler code always assumes
> that domain is setup properly. That resulted in below NULL pointer dereference
> issue.
> 
>   Jan 06 02:07:52 kernel: BUG: kernel NULL pointer dereference, address:
> 0000000000000058
>   Jan 06 02:07:52 kernel: #PF: supervisor read access in kernel mode
>   Jan 06 02:07:53 kernel: #PF: error_code(0x0000) - not-present page
>   Jan 06 02:07:53 kernel: PGD 0 P4D 0
>   Jan 06 02:07:53 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
>   Jan 06 02:07:53 kernel: CPU: 2 PID: 56 Comm: irq/24-AMD-Vi Not tainted
> 6.2.0-rc2+ #89
>   Jan 06 02:07:53 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS F.52
> 12/03/2019
>   Jan 06 02:07:53 kernel: RIP: 0010:report_iommu_fault+0x11/0x90
> 
> Ideally if domain attach fails (in this case its because pasid capability check
> returned error) we should put devices back to original domain.. so that it can
> continue without PASID capability.
> 
> I have a patch to handle these error conditions (not the fix for original
> issue). I will try to post it soon.
> 
> -Vasant