lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad725f4b-6e45-42e4-ba6d-919534bc99a4@nvidia.com>
Date: Tue, 22 Apr 2025 17:14:11 +1000
From: Balbir Singh <balbirs@...dia.com>
To: Bharata B Rao <bharata@....com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Cc: Dave Hansen <dave.hansen@...ux.intel.com>, luto@...nel.org,
 peterz@...radead.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 x86@...nel.org, hpa@...or.com, nikunj@....com, kees@...nel.org,
 alexander.deucher@....com
Subject: Re: AMD GPU driver load hitting BUG_ON in sync_global_pgds_l5()

On 4/22/25 16:34, Bharata B Rao wrote:
> Hi,
> 
> Nikunj and I have been debugging an issue seen during AMD GPU driver loading where we see the below failure:
> 
> -----------------------------------------
> kernel BUG at arch/x86/mm/init_64.c:173!
> invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
> CPU: 4 PID: 1222 Comm: modprobe Tainted: G            E      6.8.12+ #3
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b70-prebuilt.qemu.org 04/01/2014
> RIP: 0010:sync_global_pgds+0x343/0x560
> Code: fb 66 9e 01 49 89 c0 48 89 f8 0f 1f 00 48 23 05 4b 92 9f 01 48 25 00 f0 ff ff 48 03 05 de 66 9e 01 4c 39 c0 0f 84 c8 fd ff ff <0f> 0b 49 8b 75 00 4c 89 ff e8 af 62 ff ff 90 e9 d3 fd ff ff 48 8b
> RSP: 0018:ff52bf8d40a7f4e8 EFLAGS: 00010206
> RAX: ff29cef78ad1a000 RBX: fffff1458477e080 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000010ad1a067
> RBP: ff52bf8d40a7f530 R08: ff29cef78a0d0000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000000 R12: ff29cef79bd8322c
> R13: ffffffffafc3c000 R14: 0000314480400000 R15: ff29cef79df82000
> FS:  00007e1c04bf8000(0000) GS:ff29cfe72ea00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007e7e161f2a50 CR3: 0000000112c9a004 CR4: 0000000000771ef0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  ? show_regs+0x72/0x90
>  ? die+0x38/0xb0
>  ? do_trap+0xe3/0x100
>  ? do_error_trap+0x75/0xb0
>  ? sync_global_pgds+0x343/0x560
>  ? exc_invalid_op+0x53/0x80
>  ? sync_global_pgds+0x343/0x560
>  ? asm_exc_invalid_op+0x1b/0x20
>  ? sync_global_pgds+0x343/0x560
>  ? sync_global_pgds+0x2d4/0x560
>  vmemmap_populate+0x73/0xd0
>  __populate_section_memmap+0x1fc/0x440
>  sparse_add_section+0x155/0x390
>  __add_pages+0xd1/0x190
>  add_pages+0x17/0x70
>  memremap_pages+0x471/0x6d0
>  devm_memremap_pages+0x23/0x70
>  kgd2kfd_init_zone_device+0x14a/0x270 [amdgpu]
>  amdgpu_device_init+0x3042/0x3150 [amdgpu]
>  ? do_pci_enable_device+0xcc/0x110
>  amdgpu_driver_load_kms+0x1a/0x1c0 [amdgpu]
>  amdgpu_pci_probe+0x1ba/0x610 [amdgpu]
>  ? _raw_spin_unlock_irqrestore+0x11/0x60
>  local_pci_probe+0x4b/0xb0
>  pci_device_probe+0xc8/0x290
>  really_probe+0x1d5/0x440
>  __driver_probe_device+0x8a/0x190
>  driver_probe_device+0x23/0xd0
>  __driver_attach+0x10f/0x220
>  ? __pfx___driver_attach+0x10/0x10
>  bus_for_each_dev+0x7d/0xe0
>  driver_attach+0x1e/0x30
>  bus_add_driver+0x14e/0x290
>  driver_register+0x64/0x140
>  ? __pfx_amdgpu_init+0x10/0x10 [amdgpu]
>  __pci_register_driver+0x61/0x70
>  amdgpu_init+0x69/0xff0 [amdgpu]
>  do_one_initcall+0x49/0x330
>  ? kmalloc_trace+0x136/0x380
>  do_init_module+0x99/0x2b0
>  load_module+0x241e/0x24e0
>  init_module_from_file+0x9a/0x100
>  ? init_module_from_file+0x9a/0x100
>  idempotent_init_module+0x184/0x240
>  __x64_sys_finit_module+0x64/0xd0
>  x64_sys_call+0x1c4c/0x2660
>  do_syscall_64+0x80/0x170
>  ? ksys_mmap_pgoff+0x123/0x270
>  ? do_syscall_64+0x8c/0x170
>  ? syscall_exit_to_user_mode+0x83/0x260
>  ? do_syscall_64+0x8c/0x170
>  ? do_syscall_64+0x8c/0x170
>  ? exc_page_fault+0x95/0x1b0
>  entry_SYSCALL_64_after_hwframe+0x78/0x80
> RIP: 0033:0x7e1c0431e88d
> Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 73 b5 0f 00 f7 d8 64 89 01 48
> RSP: 002b:00007fffa97770b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
> RAX: ffffffffffffffda RBX: 00006198887830f0 RCX: 00007e1c0431e88d
> RDX: 0000000000000000 RSI: 0000619887b43cd2 RDI: 000000000000000e
> RBP: 0000000000040000 R08: 0000000000000000 R09: 0000000000000002
> R10: 000000000000000e R11: 0000000000000246 R12: 0000619887b43cd2
> R13: 0000619888783220 R14: 0000619888782600 R15: 000061988878d190
>  </TASK>
> -----------------------------------------
> 
> A KVM guest (with 5 level page table enabled) is started with 8 GPUs (AMD GPU driver gets loaded) and CoralGemm workload (matrix multiplication stress) is run inside the guest. The guest is turned off after the workload run completes.
> 
> This test(start guest, run workload, turn off guest) is repeated for hundreds of time and approximately once in 500 such runs or so, AMD GPU driver fails to load as it hits the above mentioned problem.
> 
> As part of GPU driver load, the GPU memory gets hotplugged. When struct page mappings are getting created for the newly coming-in pages in vmemmap, the newly created PGD is synced with the per-process page tables. However the kernel finds that a different mapping for that PGD already exists for one of the processes and hence throws up the above error.
> 
> The debug print from __add_pages() shows the pfn that is getting added and the number of pages like this:
> __add_pages pfn fffc010000 nr_pages 67043328 nid 0
> 
> Later in sync_global_pgds_l5(), the start and end addresses are coming out like this:
> start = 0x314480400000 end = 0x3144805fffff
> 
> These are essentially the addresses of struct page and such addresses for page pointers are unexpected. The start address was obtained from page_to_pfn() which for the sparsemem case is defined like this:
> 
> #define __pfn_to_page(pfn)      (vmemmap + (pfn))
> 
> When the problem is hit, vmemmap was found to have a value of 0xfffff14580000000. For the pfn value of 0xfffc010000,
> 
> start = 0xfffff14580000000(vmemmap) + 0xfffc010000(pfn) * 0x40(size of struct page) overflows (wraps around) and results in the start address of 0x314480400000.
> 
> This points to the problem of vmemmap_base selection by KASLR in kernel_randomize_memory(). Once in a while, due to randomization, vmemmap_base gets such a high value that when accommodating the hot-plugged pages, the address overflows resulting in invalid address that gets into problem later when syncing of PGDs.
> 
> The test ran for 1000 iterations when KASLR was disabled without hitting the issue.
> 
> At the outset, it appears that the selection of vmemmap_base doesn't seem to consider if there is going to be enough room of accommodating future hot plugged pages.
> 
> Also as per x86_64/mm.rst, for 5 level page table case, the range for vmemmap is ffd4000000000000 - ffd5ffffffffffff. Is it correct for vmemmap_base to start from a value which is outside the prescribed range as seen in this case?
> 
> Any pointers on how to correctly address this issue?
> 
> 

Could you please confirm if this is a new issue? Sounds like your hitting it on 6.8.12+?
I've never tested this on a system with 5 levels of page tables, but with 5 levels you get
52 bits of VA and you'll need to look at the KASLR logic (max_pfn + padding) to see where
your ranges are getting assigned.

I'd start by dumping the kaslr_regions array.

Balbir Singh


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ