[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <bae920c0-a0ff-4d85-a37a-6b8518c0ac41@amd.com>
Date: Tue, 22 Apr 2025 12:04:53 +0530
From: Bharata B Rao <bharata@....com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc: Dave Hansen <dave.hansen@...ux.intel.com>, luto@...nel.org,
peterz@...radead.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
x86@...nel.org, hpa@...or.com, nikunj@....com,
Balbir Singh <balbirs@...dia.com>, kees@...nel.org, alexander.deucher@....com
Subject: AMD GPU driver load hitting BUG_ON in sync_global_pgds_l5()
Hi,
Nikunj and I have been debugging an issue seen during AMD GPU driver
loading where we see the below failure:
-----------------------------------------
kernel BUG at arch/x86/mm/init_64.c:173!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 1222 Comm: modprobe Tainted: G E 6.8.12+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.16.3-0-ga6ed6b70-prebuilt.qemu.org 04/01/2014
RIP: 0010:sync_global_pgds+0x343/0x560
Code: fb 66 9e 01 49 89 c0 48 89 f8 0f 1f 00 48 23 05 4b 92 9f 01 48 25
00 f0 ff ff 48 03 05 de 66 9e 01 4c 39 c0 0f 84 c8 fd ff ff <0f> 0b 49
8b 75 00 4c 89 ff e8 af 62 ff ff 90 e9 d3 fd ff ff 48 8b
RSP: 0018:ff52bf8d40a7f4e8 EFLAGS: 00010206
RAX: ff29cef78ad1a000 RBX: fffff1458477e080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000010ad1a067
RBP: ff52bf8d40a7f530 R08: ff29cef78a0d0000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ff29cef79bd8322c
R13: ffffffffafc3c000 R14: 0000314480400000 R15: ff29cef79df82000
FS: 00007e1c04bf8000(0000) GS:ff29cfe72ea00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007e7e161f2a50 CR3: 0000000112c9a004 CR4: 0000000000771ef0
PKRU: 55555554
Call Trace:
<TASK>
? show_regs+0x72/0x90
? die+0x38/0xb0
? do_trap+0xe3/0x100
? do_error_trap+0x75/0xb0
? sync_global_pgds+0x343/0x560
? exc_invalid_op+0x53/0x80
? sync_global_pgds+0x343/0x560
? asm_exc_invalid_op+0x1b/0x20
? sync_global_pgds+0x343/0x560
? sync_global_pgds+0x2d4/0x560
vmemmap_populate+0x73/0xd0
__populate_section_memmap+0x1fc/0x440
sparse_add_section+0x155/0x390
__add_pages+0xd1/0x190
add_pages+0x17/0x70
memremap_pages+0x471/0x6d0
devm_memremap_pages+0x23/0x70
kgd2kfd_init_zone_device+0x14a/0x270 [amdgpu]
amdgpu_device_init+0x3042/0x3150 [amdgpu]
? do_pci_enable_device+0xcc/0x110
amdgpu_driver_load_kms+0x1a/0x1c0 [amdgpu]
amdgpu_pci_probe+0x1ba/0x610 [amdgpu]
? _raw_spin_unlock_irqrestore+0x11/0x60
local_pci_probe+0x4b/0xb0
pci_device_probe+0xc8/0x290
really_probe+0x1d5/0x440
__driver_probe_device+0x8a/0x190
driver_probe_device+0x23/0xd0
__driver_attach+0x10f/0x220
? __pfx___driver_attach+0x10/0x10
bus_for_each_dev+0x7d/0xe0
driver_attach+0x1e/0x30
bus_add_driver+0x14e/0x290
driver_register+0x64/0x140
? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
__pci_register_driver+0x61/0x70
amdgpu_init+0x69/0xff0 [amdgpu]
do_one_initcall+0x49/0x330
? kmalloc_trace+0x136/0x380
do_init_module+0x99/0x2b0
load_module+0x241e/0x24e0
init_module_from_file+0x9a/0x100
? init_module_from_file+0x9a/0x100
idempotent_init_module+0x184/0x240
__x64_sys_finit_module+0x64/0xd0
x64_sys_call+0x1c4c/0x2660
do_syscall_64+0x80/0x170
? ksys_mmap_pgoff+0x123/0x270
? do_syscall_64+0x8c/0x170
? syscall_exit_to_user_mode+0x83/0x260
? do_syscall_64+0x8c/0x170
? do_syscall_64+0x8c/0x170
? exc_page_fault+0x95/0x1b0
entry_SYSCALL_64_after_hwframe+0x78/0x80
RIP: 0033:0x7e1c0431e88d
Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007fffa97770b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 00006198887830f0 RCX: 00007e1c0431e88d
RDX: 0000000000000000 RSI: 0000619887b43cd2 RDI: 000000000000000e
RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000002
R10: 000000000000000e R11: 0000000000000246 R12: 0000619887b43cd2
R13: 0000619888783220 R14: 0000619888782600 R15: 000061988878d190
</TASK>
-----------------------------------------
A KVM guest (with 5 level page table enabled) is started with 8 GPUs
(AMD GPU driver gets loaded) and CoralGemm workload (matrix
multiplication stress) is run inside the guest. The guest is turned off
after the workload run completes.
This test(start guest, run workload, turn off guest) is repeated for
hundreds of time and approximately once in 500 such runs or so, AMD GPU
driver fails to load as it hits the above mentioned problem.
As part of GPU driver load, the GPU memory gets hotplugged. When struct
page mappings are getting created for the newly coming-in pages in
vmemmap, the newly created PGD is synced with the per-process page
tables. However the kernel finds that a different mapping for that PGD
already exists for one of the processes and hence throws up the above error.
The debug print from __add_pages() shows the pfn that is getting added
and the number of pages like this:
__add_pages pfn fffc010000 nr_pages 67043328 nid 0
Later in sync_global_pgds_l5(), the start and end addresses are coming
out like this:
start = 0x314480400000 end = 0x3144805fffff
These are essentially the addresses of struct page and such addresses
for page pointers are unexpected. The start address was obtained from
page_to_pfn() which for the sparsemem case is defined like this:
#define __pfn_to_page(pfn) (vmemmap + (pfn))
When the problem is hit, vmemmap was found to have a value of
0xfffff14580000000. For the pfn value of 0xfffc010000,
start = 0xfffff14580000000(vmemmap) + 0xfffc010000(pfn) * 0x40(size of
struct page) overflows (wraps around) and results in the start address
of 0x314480400000.
This points to the problem of vmemmap_base selection by KASLR in
kernel_randomize_memory(). Once in a while, due to randomization,
vmemmap_base gets such a high value that when accommodating the
hot-plugged pages, the address overflows resulting in invalid address
that gets into problem later when syncing of PGDs.
The test ran for 1000 iterations when KASLR was disabled without hitting
the issue.
At the outset, it appears that the selection of vmemmap_base doesn't
seem to consider if there is going to be enough room of accommodating
future hot plugged pages.
Also as per x86_64/mm.rst, for 5 level page table case, the range for
vmemmap is ffd4000000000000 - ffd5ffffffffffff. Is it correct for
vmemmap_base to start from a value which is outside the prescribed range
as seen in this case?
Any pointers on how to correctly address this issue?
Regards,
Bharata.
Powered by blists - more mailing lists