lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <83cf7fc7-23e0-46f5-916b-5341a0ab9599@amd.com>
Date: Wed, 23 Apr 2025 15:00:17 +0530
From: Bharata B Rao <bharata@....com>
To: Dave Hansen <dave.hansen@...el.com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org
Cc: Dave Hansen <dave.hansen@...ux.intel.com>, luto@...nel.org,
 peterz@...radead.org, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
 x86@...nel.org, hpa@...or.com, nikunj@....com,
 Balbir Singh <balbirs@...dia.com>, kees@...nel.org, alexander.deucher@....com
Subject: Re: AMD GPU driver load hitting BUG_ON in sync_global_pgds_l5()

On 22-Apr-25 8:43 PM, Dave Hansen wrote:
> On 4/21/25 23:34, Bharata B Rao wrote:
>> At the outset, it appears that the selection of vmemmap_base doesn't
>> seem to consider if there is going to be enough room of accommodating
>> future hot plugged pages.
> 
> Is this future hotplug area in the memory map at boot?

The KVM guest isn't using any -m maxmem option if that's what you are 
hinting at.

BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
BIOS-e820: [mem 0x0000000000100000-0x000000007ffdafff] usable
BIOS-e820: [mem 0x000000007ffdb000-0x000000007fffffff] reserved
BIOS-e820: [mem 0x00000000b0000000-0x00000000bfffffff] reserved
BIOS-e820: [mem 0x00000000fed1c000-0x00000000fed1ffff] reserved
BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
BIOS-e820: [mem 0x0000000100000000-0x000000f4a3ffffff] usable
BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reserved

kaslr_region: base[0] ff4552df80000000 size_tb 1000
kaslr_region: end[0] fffffffffffff
kaslr_region: base[1] ff69c69640000000 size_tb 3200
kaslr_region: base[2] ffd3140680000000 size_tb 40

So vmemmap_base is 0xffd3140680000000

Also the last and max_arch pfns are reported like this:
last_pfn = 0x7ffdb max_arch_pfn = 0x10000000000

Here is some data for the hotplug that happens for the 8 GPUs.

Driver is passing the following values for pgmap->range.start, 
pgmap->range.end and pgmap->type in dev_memremap_pages():

amdgpu: kgd2kfd_init_zone_device: start fffc010000000 end fffffffffffff 
type 1
amdgpu: kgd2kfd_init_zone_device: start fff8020000000 end fffc00fffffff 
type 1
amdgpu: kgd2kfd_init_zone_device: start fff4030000000 end fff801fffffff 
type 1
amdgpu: kgd2kfd_init_zone_device: start fff0040000000 end fff402fffffff 
type 1
amdgpu: kgd2kfd_init_zone_device: start ffec050000000 end fff003fffffff 
type 1
amdgpu: kgd2kfd_init_zone_device: start ffe8060000000 end ffec04fffffff 
type 1
amdgpu: kgd2kfd_init_zone_device: start ffe4070000000 end ffe805fffffff 
type 1
amdgpu: kgd2kfd_init_zone_device: start ffe0080000000 end ffe406fffffff 
type 1

The pfn and the number of pages being added in response to the above:
__add_pages pfn fffc010000 nr_pages 67043328 nid 0
__add_pages pfn fff8020000 nr_pages 67043328 nid 0
__add_pages pfn fff4030000 nr_pages 67043328 nid 0
__add_pages pfn fff0040000 nr_pages 67043328 nid 0
__add_pages pfn ffec050000 nr_pages 67043328 nid 0
__add_pages pfn ffe8060000 nr_pages 67043328 nid 0
__add_pages pfn ffe4070000 nr_pages 67043328 nid 0
__add_pages pfn ffe0080000 nr_pages 67043328 nid 0


For the above vmemmap_base, the (first) addresses seen in
sync_global_pgds_l5() for the above 8 hotplug cases are like this:
start ffd3540580400000, end = ffd35405805fffff
start ffd3540480800000, end = ffd35404809fffff
start ffd3540380c00000, end = ffd3540380dfffff
start ffd3540281000000, end = ffd35402811fffff
start ffd3540181400000, end = ffd35401815fffff
start ffd3540081800000, end = ffd35400819fffff
start ffd353ff81c00000, end = ffd353ff81dfffff
start ffd353fe82000000, end = ffd353fe821fffff

This is for the case that succeeds while I have shown the same data for 
the case that fails in the first mail thread.

When randomization results in bad vmemmap_base address, the hotplug of 
1st page for the 1st GPU results in BUG_ON.

Regards,
Bharata.



Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ