lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <091379d2-c7b1-9eb1-f0de-c59ddaad7b22@bell.net>
Date:   Tue, 10 Jan 2023 17:12:25 -0500
From:   Matt Fagnani <matt.fagnani@...l.net>
To:     Baolu Lu <baolu.lu@...ux.intel.com>
Cc:     Jason Gunthorpe <jgg@...dia.com>,
        Vasant Hegde <vasant.hegde@....com>,
        Thorsten Leemhuis <regressions@...mhuis.info>,
        Joerg Roedel <jroedel@...e.de>,
        "iommu@...ts.linux.dev" <iommu@...ts.linux.dev>,
        LKML <linux-kernel@...r.kernel.org>,
        "regressions@...ts.linux.dev" <regressions@...ts.linux.dev>,
        Linux PCI <linux-pci@...r.kernel.org>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Christian König <christian.koenig@....com>,
        Alex Deucher <alexander.deucher@....com>,
        "Pan, Xinhui" <Xinhui.Pan@....com>,
        Felix Kuehling <felix.kuehling@....com>,
        amd-gfx@...ts.freedesktop.org
Subject: Re: [regression, bisected, pci/iommu] Bug 216865 - Black screen when amdgpu started during 6.2-rc1 boot with AMD IOMMU enabled

Baolu,

I ran git stash and git checkout v6.2-rc3 to reset to a fresh 6.2-rc3. I 
checked that the previous change had been removed by looking at 
drivers/pci/ats.c and gitk. I ran git revert 201007ef707a with v6.2-rc3 
and built that. 6.2-rc3 with 201007ef707a reverted booted normally 
without the problem.

I reset to 6.2-rc3 and checked the change was removed as before. I 
applied your second patch with git apply 
0001-for-debug-purpose-only.patch and built that. 6.2-rc3 with 
0001-for-debug-purpose-only.patch had the black screen problem. I booted 
it a second time with rd.driver.blacklist=amdgpu on the kernel command 
line so amdgpu wouldn't be started while the initramfs was in use and 
the journal would be saved. The black screen happened later in the boot 
as before. I pressed sysrq+alt+s,u,b. The journal of that boot didn't 
have the two warnings I reported before. A different null pointer 
dereference happened with pci_acs_enabled at the top of the trace which 
made amdgpu crash as follows.

Jan 10 16:32:31 kernel: [drm] amdgpu kernel modesetting enabled.
Jan 10 16:32:31 kernel: amdgpu: Topology: Add APU node [0x0:0x0]
Jan 10 16:32:31 kernel: Console: switching to colour dummy device 80x25
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: vgaarb: deactivate vga console
Jan 10 16:32:31 kernel: [drm] initializing kernel modesetting (CARRIZO 
0x1002:0x9874 0x103C:0x8332 0xCA).
Jan 10 16:32:31 kernel: [drm] register mmio base: 0xF0400000
Jan 10 16:32:31 kernel: [drm] register mmio size: 262144
Jan 10 16:32:31 kernel: [drm] add ip block number 0 <vi_common>
Jan 10 16:32:31 kernel: [drm] add ip block number 1 <gmc_v8_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 2 <cz_ih>
Jan 10 16:32:31 kernel: [drm] add ip block number 3 <gfx_v8_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 4 <sdma_v3_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 5 <powerplay>
Jan 10 16:32:31 kernel: [drm] add ip block number 6 <dm>
Jan 10 16:32:31 kernel: [drm] add ip block number 7 <uvd_v6_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 8 <vce_v3_0>
Jan 10 16:32:31 kernel: [drm] add ip block number 9 <acp_ip>
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: amdgpu: Fetched VBIOS from VFCT
Jan 10 16:32:31 kernel: amdgpu: ATOM BIOS: 113-C75100-031
Jan 10 16:32:31 kernel: [drm] UVD is enabled in physical mode
Jan 10 16:32:31 kernel: [drm] VCE enabled in physical mode
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: amdgpu: Trusted Memory Zone 
(TMZ) feature not supported
Jan 10 16:32:31 kernel: [drm] vm size is 64 GB, 2 levels, block size is 
10-bit, fragment size is 9-bit
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: amdgpu: VRAM: 512M 
0x000000F400000000 - 0x000000F41FFFFFFF (512M used)
Jan 10 16:32:31 kernel: amdgpu 0000:00:01.0: amdgpu: GART: 1024M 
0x000000FF00000000 - 0x000000FF3FFFFFFF
Jan 10 16:32:31 kernel: [drm] Detected VRAM RAM=512M, BAR=512M
Jan 10 16:32:31 kernel: [drm] RAM width 64bits UNKNOWN
Jan 10 16:32:31 kernel: [drm] amdgpu: 512M of VRAM memory ready
Jan 10 16:32:31 kernel: [drm] amdgpu: 3704M of GTT memory ready.
Jan 10 16:32:31 kernel: [drm] GART: num cpu pages 262144, num gpu pages 
262144
Jan 10 16:32:31 kernel: [drm] PCIE GART of 1024M enabled (table at 
0x000000F400600000).
Jan 10 16:32:31 kernel: RPC: Registered named UNIX socket transport module.
Jan 10 16:32:31 kernel: RPC: Registered udp transport module.
Jan 10 16:32:31 kernel: RPC: Registered tcp transport module.
Jan 10 16:32:31 kernel: RPC: Registered tcp NFSv4.1 backchannel 
transport module.
Jan 10 16:32:31 kernel: amdgpu: hwmgr_sw_init smu backed is smu8_smu
Jan 10 16:32:31 kernel: [drm] Found UVD firmware Version: 1.91 Family ID: 11
Jan 10 16:32:31 kernel: [drm] UVD ENC is disabled
Jan 10 16:32:31 kernel: [drm] Found VCE firmware Version: 52.4 Binary ID: 3
Jan 10 16:32:31 kernel: amdgpu: smu version 27.18.00
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: values for Engine clock
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         300000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         480000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         533340
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         576000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         626090
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         685720
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         720000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         757900
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: Validation clocks:
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    level           : 8
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: values for Display clock
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         300000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         400000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         496560
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         626090
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         685720
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         757900
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         800000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         847060
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: Validation clocks:
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    level           : 8
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: values for Memory clock
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         667000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:         933000
Jan 10 16:32:31 kernel: [drm] DM_PPLIB: Validation clocks:
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    engine_max_clock: 75790
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    memory_max_clock: 93300
Jan 10 16:32:31 kernel: [drm] DM_PPLIB:    level           : 8
Jan 10 16:32:31 kernel: [drm] Display Core initialized with v3.2.215!
Jan 10 16:32:31 kernel: snd_hda_intel 0000:00:01.1: bound 0000:00:01.0 
(ops amdgpu_dm_audio_component_bind_ops [amdgpu])
Jan 10 16:32:31 kernel: [drm] UVD initialized successfully.
Jan 10 16:32:31 kernel: [drm] VCE initialized successfully.
Jan 10 16:32:31 kernel: kfd kfd: amdgpu: Allocated 3969056 bytes on gart
Jan 10 16:32:31 kernel: amdgpu: sdma_bitmap: f
Jan 10 16:32:31 kernel: BUG: kernel NULL pointer dereference, address: 
000000000000003c
Jan 10 16:32:31 kernel: #PF: supervisor read access in kernel mode
Jan 10 16:32:31 kernel: #PF: error_code(0x0000) - not-present page
Jan 10 16:32:31 kernel: PGD 0 P4D 0
Jan 10 16:32:31 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 10 16:32:31 kernel: CPU: 0 PID: 645 Comm: systemd-udevd Not tainted 
6.2.0-rc3+ #92
Jan 10 16:32:31 kernel: Hardware name: HP HP Laptop 15-bw0xx/8332, BIOS 
F.52 12/03/2019
Jan 10 16:32:31 kernel: RIP: 0010:pci_dev_specific_acs_enabled+0x36/0x80
Jan 10 16:32:31 kernel: Code: 6d a9 44 0f b7 e6 55 48 89 fd 53 48 c7 c3 
a0 0a 0d aa eb 13 66 83 f8 ff 74 16 48 8b 53 18 48 83 c3 10 48 85 d2 74 
31 0f b7 03 <66> 39 45 3c 75 e4 0f b7 43 02 66 39 45 3e 74 06 66 83 f8 
ff 75 da
Jan 10 16:32:31 kernel: RSP: 0018:ffffa8e9806ef938 EFLAGS: 00010046
Jan 10 16:32:31 kernel: RAX: 0000000000001002 RBX: ffffffffaa0d0aa0 RCX: 
0000000000000000
Jan 10 16:32:31 kernel: RDX: ffffffffa96d1590 RSI: 0000000000000014 RDI: 
0000000000000000
Jan 10 16:32:31 kernel: RBP: 0000000000000000 R08: 0000000000000002 R09: 
0000000000000000
Jan 10 16:32:31 kernel: R10: 0000000000000000 R11: ffffffffa9bf4220 R12: 
0000000000000014
Jan 10 16:32:31 kernel: R13: ffff938f90643800 R14: ffff938f41366100 R15: 
ffff938f90643960
Jan 10 16:32:31 kernel: FS:  00007feff3f6cb40(0000) 
GS:ffff939037400000(0000) knlGS:0000000000000000
Jan 10 16:32:31 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 10 16:32:31 kernel: CR2: 000000000000003c CR3: 000000010b8a8000 CR4: 
00000000001506f0
Jan 10 16:32:31 kernel: Call Trace:
Jan 10 16:32:31 kernel:  <TASK>
Jan 10 16:32:31 kernel:  pci_acs_enabled+0x14/0x80
Jan 10 16:32:31 kernel:  pci_acs_path_enabled+0x35/0x60
Jan 10 16:32:31 kernel:  pci_enable_pasid+0x5d/0xe0
Jan 10 16:32:31 kernel:  amd_iommu_attach_device+0x26a/0x300
Jan 10 16:32:31 kernel:  __iommu_attach_device+0x1b/0x90
Jan 10 16:32:31 kernel:  iommu_attach_group+0x65/0xa0
Jan 10 16:32:31 kernel:  amd_iommu_init_device+0x16b/0x250 [iommu_v2]
Jan 10 16:32:31 kernel:  kfd_iommu_resume+0x4c/0x1a0 [amdgpu]
Jan 10 16:32:31 kernel:  kgd2kfd_resume_iommu+0x12/0x30 [amdgpu]
Jan 10 16:32:31 kernel:  kgd2kfd_device_init.cold+0x346/0x49a [amdgpu]
Jan 10 16:32:31 kernel:  amdgpu_amdkfd_device_init+0x142/0x1d0 [amdgpu]
Jan 10 16:32:31 kernel:  amdgpu_device_init.cold+0x19f5/0x1e21 [amdgpu]
Jan 10 16:32:31 kernel:  ? _raw_spin_lock_irqsave+0x23/0x50
Jan 10 16:32:31 kernel:  amdgpu_driver_load_kms+0x15/0x110 [amdgpu]
Jan 10 16:32:31 kernel:  amdgpu_pci_probe+0x161/0x370 [amdgpu]
Jan 10 16:32:31 kernel:  local_pci_probe+0x41/0x80
Jan 10 16:32:31 kernel:  pci_device_probe+0xb3/0x220
Jan 10 16:32:31 kernel:  really_probe+0xde/0x380
Jan 10 16:32:31 kernel:  ? pm_runtime_barrier+0x50/0x90
Jan 10 16:32:31 kernel:  __driver_probe_device+0x78/0x170
Jan 10 16:32:31 kernel:  driver_probe_device+0x1f/0x90
Jan 10 16:32:31 kernel:  __driver_attach+0xce/0x1c0
Jan 10 16:32:31 kernel:  ? __pfx___driver_attach+0x10/0x10
Jan 10 16:32:31 kernel:  bus_for_each_dev+0x73/0xa0
Jan 10 16:32:31 kernel:  bus_add_driver+0x1ae/0x200
Jan 10 16:32:31 kernel:  driver_register+0x89/0xe0
Jan 10 16:32:31 kernel:  ? __pfx_init_module+0x10/0x10 [amdgpu]
Jan 10 16:32:31 kernel:  do_one_initcall+0x59/0x230
Jan 10 16:32:31 kernel:  do_init_module+0x4a/0x200
Jan 10 16:32:31 kernel:  __do_sys_init_module+0x157/0x180
Jan 10 16:32:31 kernel:  do_syscall_64+0x3a/0x90
Jan 10 16:32:31 kernel:  entry_SYSCALL_64_after_hwframe+0x72/0xdc
Jan 10 16:32:31 kernel: RIP: 0033:0x7feff3aede4e
Jan 10 16:32:31 kernel: Code: 48 8b 0d e5 5f 0c 00 f7 d8 64 89 01 48 83 
c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 
00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b2 5f 0c 00 f7 d8 64 
89 01 48
Jan 10 16:32:31 kernel: RSP: 002b:00007ffcfa200958 EFLAGS: 00000246 
ORIG_RAX: 00000000000000af
Jan 10 16:32:31 kernel: RAX: ffffffffffffffda RBX: 0000556204a64420 RCX: 
00007feff3aede4e
Jan 10 16:32:31 kernel: RDX: 00007feff3fa7453 RSI: 0000000016ba2751 RDI: 
00007fefc4192010
Jan 10 16:32:31 kernel: RBP: 00007feff3fa7453 R08: 27d4eb2f165667c5 R09: 
85ebca77c2b2ae63
Jan 10 16:32:31 kernel: R10: 0000000000070121 R11: 0000000000000246 R12: 
0000000000020000
Jan 10 16:32:31 kernel: R13: 0000556204960ef0 R14: 0000000000000000 R15: 
0000556204a52ef0
Jan 10 16:32:31 kernel:  </TASK>
Jan 10 16:32:31 kernel: Modules linked in: ip_set nf_tables nfnetlink 
sunrpc amdgpu(+) iwlmvm mac80211 nls_ascii vfat fat libarc4 uvcvideo 
iwlwifi videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videodev btusb 
btrtl snd_ctl_led snd_hda_codec_realtek btbcm snd_hda_codec_generic 
btintel i2c_algo_bit snd_hda_codec_hdmi ledtrig_audio videobuf2_common 
drm_ttm_helper bluetooth ttm snd_hda_intel mc snd_intel_dspcfg cfg80211 
snd_hda_codec edac_mce_amd iommu_v2 snd_hwdep mfd_core snd_hda_core 
drm_buddy gpu_sched wmi_bmof snd_seq pcspkr fam15h_power k10temp rfkill 
drm_display_helper snd_seq_device snd_pcm cec snd_timer drm_kms_helper 
i2c_scmi snd soundcore acpi_cpufreq drm zram hid_logitech_hidpp 
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sd_mod 
r8169 t10_pi sha512_ssse3 crc64_rocksoft_generic wdat_wdt crc64_rocksoft 
hid_logitech_dj crc64 sp5100_tco video wmi fuse dm_multipath
Jan 10 16:32:31 kernel: CR2: 000000000000003c
Jan 10 16:32:31 kernel: ---[ end trace 0000000000000000 ]---
Jan 10 16:32:31 kernel: RIP: 0010:pci_dev_specific_acs_enabled+0x36/0x80
Jan 10 16:32:31 kernel: Code: 6d a9 44 0f b7 e6 55 48 89 fd 53 48 c7 c3 
a0 0a 0d aa eb 13 66 83 f8 ff 74 16 48 8b 53 18 48 83 c3 10 48 85 d2 74 
31 0f b7 03 <66> 39 45 3c 75 e4 0f b7 43 02 66 39 45 3e 74 06 66 83 f8 
ff 75 da
Jan 10 16:32:31 kernel: RSP: 0018:ffffa8e9806ef938 EFLAGS: 00010046
Jan 10 16:32:31 kernel: RAX: 0000000000001002 RBX: ffffffffaa0d0aa0 RCX: 
0000000000000000
Jan 10 16:32:31 kernel: RDX: ffffffffa96d1590 RSI: 0000000000000014 RDI: 
0000000000000000
Jan 10 16:32:31 kernel: RBP: 0000000000000000 R08: 0000000000000002 R09: 
0000000000000000
Jan 10 16:32:31 kernel: R10: 0000000000000000 R11: ffffffffa9bf4220 R12: 
0000000000000014
Jan 10 16:32:31 kernel: R13: ffff938f90643800 R14: ffff938f41366100 R15: 
ffff938f90643960
Jan 10 16:32:31 kernel: FS:  00007feff3f6cb40(0000) 
GS:ffff939037400000(0000) knlGS:0000000000000000
Jan 10 16:32:31 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 10 16:32:31 kernel: CR2: 000000000000003c CR3: 000000010b8a8000 CR4: 
00000000001506f0

This trace looked similar to those of the previous warnings from 
amd_iommu_attach_device downwards. I'm attaching the full kernel log 
from that boot with 6.2-rc3 with 0001-for-debug-purpose-only.patch. I'm 
ccing the others involved in case this might be relevant to them.

Thanks,

Matt

On 1/10/23 03:41, Baolu Lu wrote:
> [offlist]
>
> Can you please try below tests?
>
> 1. with a fresh v6.2-rc3, git revert 201007ef707a
>
> 2. With a fresh v6.2-rc3, apply attached patch.
>
> -- 
> Best regards,
> baolu
>
> On 2023/1/10 16:06, Matt Fagnani wrote:
>> Baolu,
>>
>> I tried to apply your patch after checking out 6.2-rc3 and 
>> origin/master but there were there the following errors.
>>
>> git apply amd-iommu-amdgpu-boot-crash-2.patch
>> error: patch failed: drivers/pci/ats.c:382
>> error: drivers/pci/ats.c: patch does not apply
>>
>> I manually changed drivers/pci/ats.c as shown in the patch. I built 
>> 6.2-rc3 + the patch. 6.2-rc3 with the patch had the same black screen 
>> problem when booting. I added rd.driver.blacklist=amdgpu on the 
>> kernel command line to prevent amdgpu from being started while the 
>> initramfs was in use, and the black screen happened later in the boot 
>> as I described in my previous email. The journal showed the same two 
>> warnings and null pointer dereference which made amdgpu crash as I 
>> reported.
>>
>> Thanks,
>>
>> Matt
>>
>>
>>
View attachment "6.2-rc3-0001-for-debug-purpose-only.patch-journalctl-b-1-k.txt" of type "text/plain" (101847 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ