linux-kernel - Re: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSaazgjKX8PfFDXf@kernel.org>
Date: Wed, 26 Nov 2025 08:14:38 +0200
From: Mike Rapoport <rppt@...nel.org>
To: Usama Arif <usamaarif642@...il.com>
Cc: Pratyush Yadav <pratyush@...nel.org>,
	Changyuan Lyu <changyuanl@...gle.com>, akpm@...ux-foundation.org,
	linux-kernel@...r.kernel.org, anthony.yznaga@...cle.com,
	arnd@...db.de, ashish.kalra@....com, benh@...nel.crashing.org,
	bp@...en8.de, catalin.marinas@....com, corbet@....net,
	dave.hansen@...ux.intel.com, devicetree@...r.kernel.org,
	dwmw2@...radead.org, ebiederm@...ssion.com, graf@...zon.com,
	hpa@...or.com, jgowans@...zon.com, kexec@...ts.infradead.org,
	krzk@...nel.org, linux-arm-kernel@...ts.infradead.org,
	linux-doc@...r.kernel.org, linux-mm@...ck.org, luto@...nel.org,
	mark.rutland@....com, mingo@...hat.com, pasha.tatashin@...een.com,
	pbonzini@...hat.com, peterz@...radead.org, robh@...nel.org,
	rostedt@...dmis.org, saravanak@...gle.com,
	skinsburskii@...ux.microsoft.com, tglx@...utronix.de,
	thomas.lendacky@....com, will@...nel.org, x86@...nel.org,
	Breno Leitao <leitao@...ian.org>, thevlad@...a.com
Subject: Re: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for
 memory below 1M

On Tue, Nov 25, 2025 at 06:47:15PM +0000, Usama Arif wrote:
> 
> 
> On 25/11/2025 13:50, Mike Rapoport wrote:
> > Hi,
> > 
> > On Tue, Nov 25, 2025 at 02:15:34PM +0100, Pratyush Yadav wrote:
> >> On Mon, Nov 24 2025, Usama Arif wrote:
> > 
> >>>> --- a/arch/x86/realmode/init.c
> >>>> +++ b/arch/x86/realmode/init.c
> >>>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void)
> >>>>  	 * setup_arch().
> >>>>  	 */
> >>>>  	memblock_reserve(0, SZ_1M);
> >>>> +
> >>>> +	memblock_clear_kho_scratch(0, SZ_1M);
> >>>>  }
> >>>>  
> >>>>  static void __init sme_sev_setup_real_mode(struct trampoline_header *th)
> >>>
> >>> Hello!
> >>>
> >>> I am working with Breno who reported that we are seeing the below warning at boot
> >>> when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host
> >>> manually but we are seeing this several times a day inside the fleet.
> >>>
> >>>  20:16:33  ------------[ cut here ]------------
> >>>  20:16:33  WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330
> >>>  20:16:33  Modules linked in:
> >>>  20:16:33  CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S                  6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE 
> >>>  20:16:33  Tainted: [S]=CPU_OUT_OF_SPEC
> >>>  20:16:33  RIP: 0010:memblock_add_range+0x316/0x330
> >>>  20:16:33  Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc
> >>>  20:16:33  RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000
> >>>  20:16:33  RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002
> >>>  20:16:33  RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8
> >>>  20:16:33  RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101
> >>>  20:16:33  R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00
> >>>  20:16:33  R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000
> >>>  20:16:33  FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> >>>  20:16:33  CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0
> >>>  20:16:33  Call Trace:
> >>>  20:16:33   <TASK>
> >>>  20:16:33   ? __memblock_reserve+0x75/0x80
> > 
> > Do you have faddr2line for this?
> > >>>  20:16:33   ? setup_arch+0x30f/0xb10
> > 
> > And this?
> > 
> 
> 
> Thanks for this! I think it helped narrow down the problem.
> 
> The stack is:
> 
> 20:16:33 ? __memblock_reserve (mm/memblock.c:936) 
> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) 
> 20:16:33 ? start_kernel (init/main.c:922) 
> 20:16:33 ? x86_64_start_reservations (arch/x86/kernel/ebda.c:57) 
> 20:16:33 ? x86_64_start_kernel (arch/x86/kernel/head64.c:231) 
> 20:16:33 ? common_startup_64 (arch/x86/kernel/head_64.S:419) 
> 
> This is 6.16 kernel.
> 
> 20:16:33 ? __memblock_reserve (mm/memblock.c:936) 
> Thats memblock_add_range call in memblock_reserve
> 
> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) 
> That is parse_setup_data -> add_early_ima_buffer -> add_early_ima_buffer -> memblock_reserve_kern
> 
> 
> I put a simple print like below:
> 
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 680d1b6dfea41..cc97ffc0083c7 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -409,6 +409,7 @@ static void __init add_early_ima_buffer(u64 phys_addr)
>         }
>  
>         if (data->size) {
> +               pr_err("PPP %s %s %d data->addr %llx, data->size %llx \n", __FILE__, __func__, __LINE__, data->addr, data->size);
>                 memblock_reserve_kern(data->addr, data->size);
>                 ima_kexec_buffer_phys = data->addr;
>                 ima_kexec_buffer_size = data->size;
> 
> 
> and I see (without replicating the warning):
> 
> [    0.000000] PPP arch/x86/kernel/setup.c add_early_ima_buffer 412 data->addr 9e000, data->size 1000                                                                                          
> ....

So it looks like in cases when the warning reproduces there's something
that reserves memory overlapping with IMA buffer before
add_early_ima_buffer().

> 
> [    0.000348] MEMBLOCK configuration:
> [    0.000348]  memory size = 0x0000003fea329ff0 reserved size = 0x00000000050c969b
> [    0.000350]  memory.cnt  = 0x5
> [    0.000351]  memory[0x0]     [0x0000000000001000-0x000000000009ffff], 0x000000000009f000 bytes flags: 0x40
> [    0.000353]  memory[0x1]     [0x0000000000100000-0x0000000067c65fff], 0x0000000067b66000 bytes flags: 0x0
> [    0.000355]  memory[0x2]     [0x000000006d8db000-0x000000006fffffff], 0x0000000002725000 bytes flags: 0x0
> [    0.000356]  memory[0x3]     [0x0000000100000000-0x000000407fff8fff], 0x0000003f7fff9000 bytes flags: 0x0
> [    0.000358]  memory[0x4]     [0x000000407fffa000-0x000000407fffffff], 0x0000000000006000 bytes flags: 0x0
> [    0.000359]  reserved.cnt  = 0x7
> 
> 
> So MEMBLOCK_RSRV_KERN and MEMBLOCK_KHO_SCRATCH seem to overlap..

It does not matter, they are set on different arrays. RSRV_KERN is set on
regions in memblock.reserved and KHO_SCRATCH is set on regions in
memblock.memory.

So dumping memblock.memory is completely irrelevant, you need to check
memblock.reserved for potential conflicts.
 
> >>>  20:16:33   ? start_kernel+0x58/0x960
> >>>  20:16:33   ? x86_64_start_reservations+0x20/0x20
> >>>  20:16:33   ? x86_64_start_kernel+0x13d/0x140
> >>>  20:16:33   ? common_startup_64+0x13e/0x140
> >>>  20:16:33   </TASK>
> >>>  20:16:33  ---[ end trace 0000000000000000 ]--- 
> >>>
> >>>
> >>> Rolling out with memblock=debug is not really an option in a large scale fleet due to the
> >>> time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see:
> > 
> > Is it a problem to roll out a kernel that has additional debug printouts as
> > Breno suggested earlier? I.e.
> > 
> > 	if (flags != MEMBLOCK_NONE && flags != rgn->flags) {
> > 		pr_warn("memblock: Flag mismatch at region [%pa-%pa]\n",
> > 			&rgn->base, &rend);
> > 		pr_warn("  Existing region flags: %#x\n", rgn->flags);
> > 		pr_warn("  New range flags: %#x\n", flags);
> > 		pr_warn("  New range: [%pa-%pa]\n", &base, &end);
> > 		WARN_ON_ONCE(1);
> > 	}
> > 
> 
> I can add this, but the only thing is that it might be several weeks between me putting this in the
> kernel and that kernel being deployed to enough machines that it starts to show up. I think the IMA coinciding
> with memblock_mark_kho_scratch in e820__memblock_setup could be the reason for the warning. It might be better to
> fix that case and deploy it to see if the warnings still show up?
> I can add these prints as well incase it doesnt fix the problem.
 
I really don't think that effectively disabling memblock_mark_kho_scratch()
when KHO is disabled will solve the problem because as I said the flags it
sets are on different structure than the flags set by
memblock_reserve_kern().

> > If you have the logs from failing boots up to the point where SLUB reports
> > about it's initialization, e.g. 
> > 
> > [    0.134377] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
> > 
> > something there may hint about what's the issue. 
> 
> So the boot doesnt fail, its just giving warnings in the fleet.
> I have added the dmesg to the end of the mail.
 
Thanks, unfortunately nothing jumped at me there.

> Does something like this look good? I can try deploying this (although it will take sometime to find out).
> We can get it upstream as well as that makes backports easier.
> 
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f2..257c6f0eee03d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1119,8 +1119,13 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t
>   */
>  __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
>  {
> -       return memblock_setclr_flag(&memblock.memory, base, size, 1,
> -                                   MEMBLOCK_KHO_SCRATCH);
> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
> +       if (is_kho_boot())

Please use 

	if (IS_ENABLED(CONFIG_MEMBLOCK_KHO_SCRATCH)

instead of indef.

If you send a formal patch with it, I'll take it.
I'd suggest still deploying additional debug printouts internally.

> +               return memblock_setclr_flag(&memblock.memory, base, size, 1,
> +                                           MEMBLOCK_KHO_SCRATCH);
> +#else
> +       return 0;
> +#endif
>  }
>  
>  /**
> @@ -1133,8 +1138,13 @@ __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
>   */
>  __init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size)
>  {
> -       return memblock_setclr_flag(&memblock.memory, base, size, 0,
> -                                   MEMBLOCK_KHO_SCRATCH);
> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
> +       if (is_kho_boot())
> +               return memblock_setclr_flag(&memblock.memory, base, size, 0,
> +                                           MEMBLOCK_KHO_SCRATCH);
> +#else

If nothing sets the flag _clear is anyway nop, but let's update it as well
for symmetry.

-- 
Sincerely yours,
Mike.