[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d338d7a4-ca69-400b-86b5-35e46f6da2df@kernel.org>
Date: Fri, 19 Dec 2025 09:10:49 +0100
From: "David Hildenbrand (Red Hat)" <david@...nel.org>
To: Sourabh Jain <sourabhjain@...ux.ibm.com>,
lkml <linux-kernel@...r.kernel.org>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Borislav Petkov
<bp@...en8.de>, Heiko Carstens <hca@...ux.ibm.com>,
Madhavan Srinivasan <maddy@...ux.ibm.com>,
Michael Ellerman <mpe@...erman.id.au>, Muchun Song <muchun.song@...ux.dev>,
Oscar Salvador <osalvador@...e.de>,
"Ritesh Harjani (IBM)" <ritesh.list@...il.com>,
Vasily Gorbik <gor@...ux.ibm.com>
Subject: Re: mm/hugetlb: kernel fail to boot if total hugepages size is almost
equal to system RAM
On 12/18/25 17:19, Sourabh Jain wrote:
> Hello All,
>
> I observed a kernel boot failure when the total hugepages size is almost
> equal to the system RAM.
>
> For example, a Power system with 255 GB RAM failed to boot with the
> following kernel command-line arguments:
>
> default_hugepagesz=2M hugepagesz=2M hugepages=128512
>
> The failure occurred with the following logs:
>
> Booting a command list
>
> OF stdout device is: /vdevice/vty@...00000
> Preparing to boot Linux version 6.19.0-rc1+ (root@...t) (gcc (GCC), GNU
> ld version 2.35.2-63.el9) #4 SMP Thu Dec 18 09:02:16 CST 2025
> Detected machine type: 0000000000000101
> command line:
> BOOT_IMAGE=(ieee1275//vdevice/v-scsi@...00065/disk@...0000000000000,msdos2)/vmlinuz-6.19.0-rc1+
> root=/dev/mapper/r-root ro rd.lvm.lv=root/root rd.lvm.lv=root/swap
> biosdevname=0 loglevel=7 ignore_loglevel debug console=hvc0
> earlycon=hvc0 earlyprintk crashkernel=4G default_hugepagesz=2M
> hugepagesz=2M hugepages=128512
> Max number of cores passed to firmware: 256 (NR_CPUS = 2048)
> Calling ibm,client-architecture-support... done
> memory layout at init:
> memory_limit : 0000000000000000 (16 MB aligned)
> alloc_bottom : 0000000016050000
> alloc_top : 0000000030000000
> alloc_top_hi : 0000000030000000
> rmo_top : 0000000030000000
> ram_top : 0000000030000000
> instantiating rtas at 0x000000002ec50000... done
> prom_hold_cpus: skipped
> copying OF device tree...
> Building dt strings...
> Building dt structure...
> Device tree strings 0x0000000016060000 -> 0x0000000016061844
> Device tree struct 0x0000000016070000 -> 0x0000000016080000
> Quiescing Open Firmware ...
> Booting Linux via __start() @ 0x000000000a700000 ...
> [ 0.000000] printk: debug: ignoring loglevel setting.
> [ 0.000000] crashkernel reserved: 0x0000000018000000 -
> 0x0000000118000000 (4096 MB)
> [ 0.000000] radix-mmu: Page sizes from device-tree:
> [ 0.000000] radix-mmu: Page size shift = 12 AP=0x0
> [ 0.000000] radix-mmu: Page size shift = 16 AP=0x5
> [ 0.000000] radix-mmu: Page size shift = 21 AP=0x1
> [ 0.000000] radix-mmu: Page size shift = 30 AP=0x2
> [ 0.000000] Activating Kernel Userspace Access Prevention
> [ 0.000000] Activating Kernel Userspace Execution Prevention
> [ 0.000000] radix-mmu: Mapped 0x0000000000000000-0x0000000002800000
> with 2.00 MiB pages (exec)
> [ 0.000000] radix-mmu: Mapped 0x0000000002800000-0x0000003ffde00000
> with 2.00 MiB pages
> [ 0.000000] radix-mmu: Mapped 0x0000003ffde00000-0x0000003ffdff0000
> with 64.0 KiB pages
> [ 0.000000] radix-mmu: Mapped 0x0000003fffff0000-0x0000004000000000
> with 64.0 KiB pages
> [ 0.000000] radix-mmu: Mapped 0x0000003ffdff0000-0x0000003fffff0000
> with 64.0 KiB pages
> [ 0.000000] lpar: Using radix MMU under hypervisor
> [ 0.000000] Linux version 6.19.0-rc1+ (root) (gcc (GCC) GNU ld
> version 2.35.2-63.el9) #4 SMP Thu Dec 18 09:02:16 CST 202
> 5
> [ 0.000000] OF: reserved mem: Reserved memory: No reserved-memory
> node in the DT
> [ 0.000000] Found initrd at 0xc00000000f800000:0xc000000016046afe
> [ 0.000000] Hardware name: hv:phyp pSeries
> [ 0.000000] printk: legacy bootconsole [udbg0] enabled
> [ 0.000000] Partition configured for 72 cpus.
> [ 0.000000] CPU maps initialized for 8 threads per core
> [ 0.000000] (thread shift is 3)
>
> <snip>
>
> [ 0.000000] Initmem setup node 28 as memoryless
> [ 0.000000] Initmem setup node 29 as memoryless
> [ 0.000000] Initmem setup node 30 as memoryless
> [ 0.000000] Initmem setup node 31 as memoryless
> [ 0.000000] percpu: Embedded 3 pages/cpu s126488 r0 d70120 u196608
> [ 0.000000] pcpu-alloc: s126488 r0 d70120 u196608 alloc=3*65536
> [ 0.000000] pcpu-alloc: [0] 00 [0] 01 [0] 02 [0] 03 [0] 04 [0] 05 [0]
> 06 [0] 07
> [ 0.000000] pcpu-alloc: [0] 08 [0] 09 [0] 10 [0] 11 [0] 12 [0] 13 [0]
> 14 [0] 15
> [ 0.000000] pcpu-alloc: [0] 16 [0] 17 [0] 18 [0] 19 [0] 20 [0] 21 [0]
> 22 [0] 23
> [ 0.000000] pcpu-alloc: [0] 24 [0] 25 [0] 26 [0] 27 [0] 28 [0] 29 [0]
> 30 [0] 31
> [ 0.000000] pcpu-alloc: [1] 32 [1] 33 [1] 34 [1] 35 [1] 36 [1] 37 [1]
> 38 [1] 39
> [ 0.000000] pcpu-alloc: [1] 40 [1] 41 [1] 42 [1] 43 [1] 44 [1] 45 [1]
> 46 [1] 47
> [ 0.000000] pcpu-alloc: [1] 48 [1] 49 [1] 50 [1] 51 [1] 52 [1] 53 [1]
> 54 [1] 55
> [ 0.000000] pcpu-alloc: [1] 56 [1] 57 [1] 58 [1] 59 [1] 60 [1] 61 [1]
> 62 [1] 63
> [ 0.000000] pcpu-alloc: [2] 64 [2] 65 [2] 66 [2] 67 [2] 68 [2] 69 [2]
> 70 [2] 71
> [ 0.000000] Kernel command line:
> BOOT_IMAGE=(ieee1275//vdevice/v-scsi@...00065/disk@...0000000000000,msdos2)/vmlinuz-6.19.0-rc1+
> root=/dev/mapper/root ro rd.lvm.lv=root/root rd.lvm.lv=root/swap
> biosdevname=0 loglevel=7 ignore_loglevel debug console=hvc0
> earlycon=hvc0 earlyprintk crashkernel=4G default_hugepagesz=2M hugepagesz=
> 2M hugepages=128512
> [ 0.000000] Unknown kernel command line parameters "earlyprintk
> biosdevname=0", will be passed to user space.
> [ 0.000000] random: crng init done
> [ 0.000000] printk: log buffer data + meta data: 1048576 + 3670016 =
> 4718592 bytes
>
> <snip>
>
> [ 0.070655] thermal_sys: Registered thermal governor 'step_wise'
> [ 0.070709] cpuidle: using governor menu
> [ 0.070781] RTAS daemon started
> [ 0.070984] pstore: Using crash dump compression: deflate
> [ 0.070988] pstore: Registered nvram as persistent store backend
> [ 0.071386] EEH: pSeries platform initialized
> [ 0.071459] plpks: POWER LPAR Platform KeyStore is not supported or
> enabled
> [ 0.081865] kprobes: kprobe jump-optimization is enabled. All kprobes
> are optimized if possible.
> [ 2.828787] HugeTLB: allocation took 2740ms with
> hugepage_allocation_threads=18
> [ 2.828821] HugeTLB: allocating 128512 of page size 2.00 MiB failed.
> Only allocated 128429 hugepages.
> [ 2.828852] HugeTLB: registered 2.00 MiB page size, pre-allocated
> 128429 pages
> [ 2.828855] HugeTLB: 0 KiB vmemmap can be freed for a 2.00 MiB page
> [ 2.828858] HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages
> [ 2.828862] HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page
> [ 2.831713] swapper/0: page allocation failure: order:5,
> mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=1-3
> [ 2.831732] CPU: 51 UID: 0 PID: 1 Comm: swapper/0 Not tainted
> 6.19.0-rc1+ #4 VOLUNTARY
> [ 2.831736] Hardware name: hv:phyp pSeries
> [ 2.831738] Call Trace:
> [ 2.831738] [c000001c801b77c0] [c00000000111ae6c]
> dump_stack_lvl+0x8c/0xf0 (unreliable)
> [ 2.831747] [c000001c801b77f0] [c00000000059a024] warn_alloc+0x12c/0x1d8
> [ 2.831752] [c000001c801b7890] [c00000000059a918]
> __alloc_pages_slowpath.constprop.0+0x848/0xa98
> [ 2.831755] [c000001c801b79d0] [c00000000059ae3c]
> __alloc_frozen_pages_noprof+0x2d4/0x3a8
> [ 2.831758] [c000001c801b7a50] [c0000000005eac64]
> alloc_pages_mpol+0x10c/0x1f4
> [ 2.831761] [c000001c801b7ab0] [c0000000005eadac]
> alloc_pages_noprof+0x60/0xe8
> [ 2.831763] [c000001c801b7ad0] [c0000000004d9978]
> mempool_alloc_pages+0x24/0x38
> [ 2.831767] [c000001c801b7af0] [c0000000004da4a0]
> mempool_init_node+0x138/0x1fc
> [ 2.831769] [c000001c801b7b40] [c00000000208844c]
> bio_integrity_initfn+0x40/0x70
> [ 2.831773] [c000001c801b7ba0] [c000000000010c44]
> do_one_initcall+0x60/0x36c
> [ 2.831776] [c000001c801b7c80] [c000000002006b2c]
> do_initcalls+0x12c/0x22c
> [ 2.831779] [c000001c801b7d30] [c000000002006f1c]
> kernel_init_freeable+0x23c/0x390
> [ 2.831781] [c000001c801b7de0] [c000000000011078] kernel_init+0x34/0x26c
> [ 2.831783] [c000001c801b7e50] [c00000000000dd3c]
> ret_from_kernel_user_thread+0x14/0x1c
> [ 2.831786] ---- interrupt: 0 at 0x0
> [ 2.831790] Mem-Info:
> [ 2.831871] active_anon:0 inactive_anon:0 isolated_anon:0
> [ 2.831871] active_file:0 inactive_file:0 isolated_file:0
> [ 2.831871] unevictable:0 dirty:0 writeback:0
> [ 2.831871] slab_reclaimable:82 slab_unreclaimable:2106
> [ 2.831871] mapped:0 shmem:0 pagetables:146
> [ 2.831871] sec_pagetables:0 bounce:0
> [ 2.831871] kernel_misc_reclaimable:0
> [ 2.831871] free:944 free_pcp:3099 free_cma:0
> [ 2.831903] Node 1 active_anon:0kB inactive_anon:0kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> mapped:0kB dirty:0kB writeback:0kB shmem:0kB sh
> mem_thp:0kB shmem_pmdmapped:0kB anon_thp:0kB kernel_stack:8000kB
> pagetables:4224kB sec_pagetables:0kB all_unreclaimable? no Balloon:0kB
> [ 2.831925] Node 2 active_anon:0kB inactive_anon:0kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> mapped:0kB dirty:0kB writeback:0kB shmem:0kB sh
> mem_thp:0kB shmem_pmdmapped:0kB anon_thp:0kB kernel_stack:7968kB
> pagetables:4096kB sec_pagetables:0kB all_unreclaimable? no Balloon:0kB
> [ 2.831937] Node 3 active_anon:0kB inactive_anon:0kB active_file:0kB
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
> mapped:0kB dirty:0kB writeback:0kB shmem:0kB sh
> mem_thp:0kB shmem_pmdmapped:0kB anon_thp:0kB kernel_stack:2272kB
> pagetables:1024kB sec_pagetables:0kB all_unreclaimable? no Balloon:0kB
> [ 2.831962] Node 1 Normal free:19520kB boost:0kB min:29440kB
> low:144448kB high:259456kB reserved_highatomic:0KB free_highatomic:0KB
> active_anon:0kB inactive_anon:0kB active_file:0kB inacti
> ve_file:0kB unevictable:0kB writepending:0kB zspages:0kB
> present:119537664kB managed:115056960kB mlocked:0kB bounce:0kB
> free_pcp:84992kB local_pcp:2048kB free_cma:0kB
> [ 2.831991] lowmem_reserve[]: 0 0 0
> [ 2.831997] Node 2 Normal free:39424kB boost:2048kB min:32512kB
> low:151360kB high:270208kB reserved_highatomic:0KB free_highatomic:0KB
> active_anon:0kB inactive_anon:0kB active_file:0kB ina
> ctive_file:0kB unevictable:0kB writepending:0kB zspages:0kB
> present:119013376kB managed:118885632kB mlocked:0kB bounce:0kB
> free_pcp:95552kB local_pcp:2816kB free_cma:0kB
> [ 2.832008] lowmem_reserve[]: 0 0 0
> [ 2.832011] Node 3 Normal free:1472kB boost:0kB min:7616kB
> low:37376kB high:67136kB reserved_highatomic:0KB free_highatomic:0KB
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_f
> ile:0kB unevictable:0kB writepending:0kB zspages:0kB present:29884416kB
> managed:29784448kB mlocked:0kB bounce:0kB free_pcp:17792kB local_pcp:0kB
> free_cma:0kB
> [ 2.832021] lowmem_reserve[]: 0 0 0
> [ 2.832025] Node 1 Normal: 3*64kB (UME) 3*128kB (ME) 4*256kB (UME)
> 3*512kB (UME) 4*1024kB (ME) 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 7232kB
> [ 2.832037] Node 2 Normal: 1*64kB (U) 0*128kB 1*256kB (M) 0*512kB
> 2*1024kB (UM) 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 2368kB
> [ 2.832052] Node 3 Normal: 1*64kB (E) 1*128kB (M) 3*256kB (UME)
> 1*512kB (U) 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB = 1472kB
> [ 2.832068] Node 1 hugepages_total=56043 hugepages_free=56043
> hugepages_surp=0 hugepages_size=2048kB
> [ 2.832078] Node 1 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=1048576kB
> [ 2.832086] Node 2 hugepages_total=57915 hugepages_free=57915
> hugepages_surp=0 hugepages_size=2048kB
> [ 2.832093] Node 2 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=1048576kB
> [ 2.832102] Node 3 hugepages_total=14471 hugepages_free=14471
> hugepages_surp=0 hugepages_size=2048kB
> [ 2.832111] Node 3 hugepages_total=0 hugepages_free=0
> hugepages_surp=0 hugepages_size=1048576kB
> [ 2.832119] 0 total pagecache pages
> [ 2.832122] 0 pages in swap cache
> [ 2.832127] Free swap = 0kB
> [ 2.832130] Total swap = 0kB
> [ 2.832133] 4194304 pages RAM
> [ 2.832138] 0 pages HighMem/MovableOnly
> [ 2.832141] 73569 pages reserved
> [ 2.832143] 0 pages cma reserved
> [ 2.832146] 0 pages hwpoisoned
> [ 2.832153] Memory cgroup min protection 0kB -- low protection 0kB
> [ 2.832154] Kernel panic - not syncing: bio: can't create integrity
> buf pool
> [ 2.832160] CPU: 51 UID: 0 PID: 1 Comm: swapper/0 Not tainted
> 6.19.0-rc1+ #4 VOLUNTARY
> [ 2.832164] Hardware name: hv:phyp pSeries
> [ 2.832167] Call Trace:
> [ 2.832169] [c000001c801b7a50] [c00000000111aeb8]
> dump_stack_lvl+0xd8/0xf0 (unreliable)
> [ 2.832180] [c000001c801b7a80] [c00000000015d79c] vpanic+0x2c8/0x4b4
> [ 2.832189] [c000001c801b7b20] [c00000000015d9c8] nmi_panic+0x0/0xa0
> [ 2.832197] [c000001c801b7b40] [c000000002088478]
> bio_integrity_initfn+0x6c/0x70
> [ 2.832205] [c000001c801b7ba0] [c000000000010c44]
> do_one_initcall+0x60/0x36c
> [ 2.832213] [c000001c801b7c80] [c000000002006b2c]
> do_initcalls+0x12c/0x22c
> [ 2.832221] [c000001c801b7d30] [c000000002006f1c]
> kernel_init_freeable+0x23c/0x390
> [ 2.832229] [c000001c801b7de0] [c000000000011078] kernel_init+0x34/0x26c
> [ 2.832237] [c000001c801b7e50] [c00000000000dd3c]
> ret_from_kernel_user_thread+0x14/0x1c
> [ 2.832247] ---- interrupt: 0 at 0x0
> [ 2.834181] pstore: backend (nvram) writing error (-1)
> [ 2.835809] Rebooting in 10 seconds..
>
> I agree that reserving hugepages equal to the system RAM is not very
> practical. However, would it be a good idea to make the hugepage
> memory allocator aware of the total system memory and leave some
> memory for the kernel to boot?
IMHO it's the same as with any other system mis-configuration where you
end up with too little usable RAM; like setting mem= or cma= or
crashmem= etc in a wrong way.
How are we supposed to know how much memory the kernel+user space will
actually require without running easily OOM?
--
Cheers
David
Powered by blists - more mailing lists