lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <bae920c0-a0ff-4d85-a37a-6b8518c0ac41@amd.com>
Date: Tue, 22 Apr 2025 12:04:53 +0530
From: Bharata B Rao <bharata@....com>
To: linux-kernel@...r.kernel.org, linux-mm@...ck.org
Cc: Dave Hansen <dave.hansen@...ux.intel.com>, luto@...nel.org,
 peterz@...radead.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 x86@...nel.org, hpa@...or.com, nikunj@....com,
 Balbir Singh <balbirs@...dia.com>, kees@...nel.org, alexander.deucher@....com
Subject: AMD GPU driver load hitting BUG_ON in sync_global_pgds_l5()

Hi,

Nikunj and I have been debugging an issue seen during AMD GPU driver 
loading where we see the below failure:

-----------------------------------------
kernel BUG at arch/x86/mm/init_64.c:173!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 1222 Comm: modprobe Tainted: G            E      6.8.12+ #3
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.16.3-0-ga6ed6b70-prebuilt.qemu.org 04/01/2014
RIP: 0010:sync_global_pgds+0x343/0x560
Code: fb 66 9e 01 49 89 c0 48 89 f8 0f 1f 00 48 23 05 4b 92 9f 01 48 25 
00 f0 ff ff 48 03 05 de 66 9e 01 4c 39 c0 0f 84 c8 fd ff ff <0f> 0b 49 
8b 75 00 4c 89 ff e8 af 62 ff ff 90 e9 d3 fd ff ff 48 8b
RSP: 0018:ff52bf8d40a7f4e8 EFLAGS: 00010206
RAX: ff29cef78ad1a000 RBX: fffff1458477e080 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000010ad1a067
RBP: ff52bf8d40a7f530 R08: ff29cef78a0d0000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ff29cef79bd8322c
R13: ffffffffafc3c000 R14: 0000314480400000 R15: ff29cef79df82000
FS:  00007e1c04bf8000(0000) GS:ff29cfe72ea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007e7e161f2a50 CR3: 0000000112c9a004 CR4: 0000000000771ef0
PKRU: 55555554
Call Trace:
  <TASK>
  ? show_regs+0x72/0x90
  ? die+0x38/0xb0
  ? do_trap+0xe3/0x100
  ? do_error_trap+0x75/0xb0
  ? sync_global_pgds+0x343/0x560
  ? exc_invalid_op+0x53/0x80
  ? sync_global_pgds+0x343/0x560
  ? asm_exc_invalid_op+0x1b/0x20
  ? sync_global_pgds+0x343/0x560
  ? sync_global_pgds+0x2d4/0x560
  vmemmap_populate+0x73/0xd0
  __populate_section_memmap+0x1fc/0x440
  sparse_add_section+0x155/0x390
  __add_pages+0xd1/0x190
  add_pages+0x17/0x70
  memremap_pages+0x471/0x6d0
  devm_memremap_pages+0x23/0x70
  kgd2kfd_init_zone_device+0x14a/0x270 [amdgpu]
  amdgpu_device_init+0x3042/0x3150 [amdgpu]
  ? do_pci_enable_device+0xcc/0x110
  amdgpu_driver_load_kms+0x1a/0x1c0 [amdgpu]
  amdgpu_pci_probe+0x1ba/0x610 [amdgpu]
  ? _raw_spin_unlock_irqrestore+0x11/0x60
  local_pci_probe+0x4b/0xb0
  pci_device_probe+0xc8/0x290
  really_probe+0x1d5/0x440
  __driver_probe_device+0x8a/0x190
  driver_probe_device+0x23/0xd0
  __driver_attach+0x10f/0x220
  ? __pfx___driver_attach+0x10/0x10
  bus_for_each_dev+0x7d/0xe0
  driver_attach+0x1e/0x30
  bus_add_driver+0x14e/0x290
  driver_register+0x64/0x140
  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
  __pci_register_driver+0x61/0x70
  amdgpu_init+0x69/0xff0 [amdgpu]
  do_one_initcall+0x49/0x330
  ? kmalloc_trace+0x136/0x380
  do_init_module+0x99/0x2b0
  load_module+0x241e/0x24e0
  init_module_from_file+0x9a/0x100
  ? init_module_from_file+0x9a/0x100
  idempotent_init_module+0x184/0x240
  __x64_sys_finit_module+0x64/0xd0
  x64_sys_call+0x1c4c/0x2660
  do_syscall_64+0x80/0x170
  ? ksys_mmap_pgoff+0x123/0x270
  ? do_syscall_64+0x8c/0x170
  ? syscall_exit_to_user_mode+0x83/0x260
  ? do_syscall_64+0x8c/0x170
  ? do_syscall_64+0x8c/0x170
  ? exc_page_fault+0x95/0x1b0
  entry_SYSCALL_64_after_hwframe+0x78/0x80
RIP: 0033:0x7e1c0431e88d
Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007fffa97770b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 00006198887830f0 RCX: 00007e1c0431e88d
RDX: 0000000000000000 RSI: 0000619887b43cd2 RDI: 000000000000000e
RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000002
R10: 000000000000000e R11: 0000000000000246 R12: 0000619887b43cd2
R13: 0000619888783220 R14: 0000619888782600 R15: 000061988878d190
  </TASK>
-----------------------------------------

A KVM guest (with 5 level page table enabled) is started with 8 GPUs 
(AMD GPU driver gets loaded) and CoralGemm workload (matrix 
multiplication stress) is run inside the guest. The guest is turned off 
after the workload run completes.

This test(start guest, run workload, turn off guest) is repeated for 
hundreds of time and approximately once in 500 such runs or so, AMD GPU 
driver fails to load as it hits the above mentioned problem.

As part of GPU driver load, the GPU memory gets hotplugged. When struct 
page mappings are getting created for the newly coming-in pages in 
vmemmap, the newly created PGD is synced with the per-process page 
tables. However the kernel finds that a different mapping for that PGD 
already exists for one of the processes and hence throws up the above error.

The debug print from __add_pages() shows the pfn that is getting added 
and the number of pages like this:
__add_pages pfn fffc010000 nr_pages 67043328 nid 0

Later in sync_global_pgds_l5(), the start and end addresses are coming 
out like this:
start = 0x314480400000 end = 0x3144805fffff

These are essentially the addresses of struct page and such addresses 
for page pointers are unexpected. The start address was obtained from 
page_to_pfn() which for the sparsemem case is defined like this:

#define __pfn_to_page(pfn)      (vmemmap + (pfn))

When the problem is hit, vmemmap was found to have a value of 
0xfffff14580000000. For the pfn value of 0xfffc010000,

start = 0xfffff14580000000(vmemmap) + 0xfffc010000(pfn) * 0x40(size of 
struct page) overflows (wraps around) and results in the start address 
of 0x314480400000.

This points to the problem of vmemmap_base selection by KASLR in 
kernel_randomize_memory(). Once in a while, due to randomization, 
vmemmap_base gets such a high value that when accommodating the 
hot-plugged pages, the address overflows resulting in invalid address 
that gets into problem later when syncing of PGDs.

The test ran for 1000 iterations when KASLR was disabled without hitting 
the issue.

At the outset, it appears that the selection of vmemmap_base doesn't 
seem to consider if there is going to be enough room of accommodating 
future hot plugged pages.

Also as per x86_64/mm.rst, for 5 level page table case, the range for 
vmemmap is ffd4000000000000 - ffd5ffffffffffff. Is it correct for 
vmemmap_base to start from a value which is outside the prescribed range 
as seen in this case?

Any pointers on how to correctly address this issue?

Regards,
Bharata.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ