[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aSaazgjKX8PfFDXf@kernel.org>
Date: Wed, 26 Nov 2025 08:14:38 +0200
From: Mike Rapoport <rppt@...nel.org>
To: Usama Arif <usamaarif642@...il.com>
Cc: Pratyush Yadav <pratyush@...nel.org>,
Changyuan Lyu <changyuanl@...gle.com>, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org, anthony.yznaga@...cle.com,
arnd@...db.de, ashish.kalra@....com, benh@...nel.crashing.org,
bp@...en8.de, catalin.marinas@....com, corbet@....net,
dave.hansen@...ux.intel.com, devicetree@...r.kernel.org,
dwmw2@...radead.org, ebiederm@...ssion.com, graf@...zon.com,
hpa@...or.com, jgowans@...zon.com, kexec@...ts.infradead.org,
krzk@...nel.org, linux-arm-kernel@...ts.infradead.org,
linux-doc@...r.kernel.org, linux-mm@...ck.org, luto@...nel.org,
mark.rutland@....com, mingo@...hat.com, pasha.tatashin@...een.com,
pbonzini@...hat.com, peterz@...radead.org, robh@...nel.org,
rostedt@...dmis.org, saravanak@...gle.com,
skinsburskii@...ux.microsoft.com, tglx@...utronix.de,
thomas.lendacky@....com, will@...nel.org, x86@...nel.org,
Breno Leitao <leitao@...ian.org>, thevlad@...a.com
Subject: Re: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for
memory below 1M
On Tue, Nov 25, 2025 at 06:47:15PM +0000, Usama Arif wrote:
>
>
> On 25/11/2025 13:50, Mike Rapoport wrote:
> > Hi,
> >
> > On Tue, Nov 25, 2025 at 02:15:34PM +0100, Pratyush Yadav wrote:
> >> On Mon, Nov 24 2025, Usama Arif wrote:
> >
> >>>> --- a/arch/x86/realmode/init.c
> >>>> +++ b/arch/x86/realmode/init.c
> >>>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void)
> >>>> * setup_arch().
> >>>> */
> >>>> memblock_reserve(0, SZ_1M);
> >>>> +
> >>>> + memblock_clear_kho_scratch(0, SZ_1M);
> >>>> }
> >>>>
> >>>> static void __init sme_sev_setup_real_mode(struct trampoline_header *th)
> >>>
> >>> Hello!
> >>>
> >>> I am working with Breno who reported that we are seeing the below warning at boot
> >>> when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host
> >>> manually but we are seeing this several times a day inside the fleet.
> >>>
> >>> 20:16:33 ------------[ cut here ]------------
> >>> 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330
> >>> 20:16:33 Modules linked in:
> >>> 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE
> >>> 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC
> >>> 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330
> >>> 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc
> >>> 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000
> >>> 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002
> >>> 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8
> >>> 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101
> >>> 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00
> >>> 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000
> >>> 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
> >>> 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0
> >>> 20:16:33 Call Trace:
> >>> 20:16:33 <TASK>
> >>> 20:16:33 ? __memblock_reserve+0x75/0x80
> >
> > Do you have faddr2line for this?
> > >>> 20:16:33 ? setup_arch+0x30f/0xb10
> >
> > And this?
> >
>
>
> Thanks for this! I think it helped narrow down the problem.
>
> The stack is:
>
> 20:16:33 ? __memblock_reserve (mm/memblock.c:936)
> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956)
> 20:16:33 ? start_kernel (init/main.c:922)
> 20:16:33 ? x86_64_start_reservations (arch/x86/kernel/ebda.c:57)
> 20:16:33 ? x86_64_start_kernel (arch/x86/kernel/head64.c:231)
> 20:16:33 ? common_startup_64 (arch/x86/kernel/head_64.S:419)
>
> This is 6.16 kernel.
>
> 20:16:33 ? __memblock_reserve (mm/memblock.c:936)
> Thats memblock_add_range call in memblock_reserve
>
> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956)
> That is parse_setup_data -> add_early_ima_buffer -> add_early_ima_buffer -> memblock_reserve_kern
>
>
> I put a simple print like below:
>
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 680d1b6dfea41..cc97ffc0083c7 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -409,6 +409,7 @@ static void __init add_early_ima_buffer(u64 phys_addr)
> }
>
> if (data->size) {
> + pr_err("PPP %s %s %d data->addr %llx, data->size %llx \n", __FILE__, __func__, __LINE__, data->addr, data->size);
> memblock_reserve_kern(data->addr, data->size);
> ima_kexec_buffer_phys = data->addr;
> ima_kexec_buffer_size = data->size;
>
>
> and I see (without replicating the warning):
>
> [ 0.000000] PPP arch/x86/kernel/setup.c add_early_ima_buffer 412 data->addr 9e000, data->size 1000
> ....
So it looks like in cases when the warning reproduces there's something
that reserves memory overlapping with IMA buffer before
add_early_ima_buffer().
>
> [ 0.000348] MEMBLOCK configuration:
> [ 0.000348] memory size = 0x0000003fea329ff0 reserved size = 0x00000000050c969b
> [ 0.000350] memory.cnt = 0x5
> [ 0.000351] memory[0x0] [0x0000000000001000-0x000000000009ffff], 0x000000000009f000 bytes flags: 0x40
> [ 0.000353] memory[0x1] [0x0000000000100000-0x0000000067c65fff], 0x0000000067b66000 bytes flags: 0x0
> [ 0.000355] memory[0x2] [0x000000006d8db000-0x000000006fffffff], 0x0000000002725000 bytes flags: 0x0
> [ 0.000356] memory[0x3] [0x0000000100000000-0x000000407fff8fff], 0x0000003f7fff9000 bytes flags: 0x0
> [ 0.000358] memory[0x4] [0x000000407fffa000-0x000000407fffffff], 0x0000000000006000 bytes flags: 0x0
> [ 0.000359] reserved.cnt = 0x7
>
>
> So MEMBLOCK_RSRV_KERN and MEMBLOCK_KHO_SCRATCH seem to overlap..
It does not matter, they are set on different arrays. RSRV_KERN is set on
regions in memblock.reserved and KHO_SCRATCH is set on regions in
memblock.memory.
So dumping memblock.memory is completely irrelevant, you need to check
memblock.reserved for potential conflicts.
> >>> 20:16:33 ? start_kernel+0x58/0x960
> >>> 20:16:33 ? x86_64_start_reservations+0x20/0x20
> >>> 20:16:33 ? x86_64_start_kernel+0x13d/0x140
> >>> 20:16:33 ? common_startup_64+0x13e/0x140
> >>> 20:16:33 </TASK>
> >>> 20:16:33 ---[ end trace 0000000000000000 ]---
> >>>
> >>>
> >>> Rolling out with memblock=debug is not really an option in a large scale fleet due to the
> >>> time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see:
> >
> > Is it a problem to roll out a kernel that has additional debug printouts as
> > Breno suggested earlier? I.e.
> >
> > if (flags != MEMBLOCK_NONE && flags != rgn->flags) {
> > pr_warn("memblock: Flag mismatch at region [%pa-%pa]\n",
> > &rgn->base, &rend);
> > pr_warn(" Existing region flags: %#x\n", rgn->flags);
> > pr_warn(" New range flags: %#x\n", flags);
> > pr_warn(" New range: [%pa-%pa]\n", &base, &end);
> > WARN_ON_ONCE(1);
> > }
> >
>
> I can add this, but the only thing is that it might be several weeks between me putting this in the
> kernel and that kernel being deployed to enough machines that it starts to show up. I think the IMA coinciding
> with memblock_mark_kho_scratch in e820__memblock_setup could be the reason for the warning. It might be better to
> fix that case and deploy it to see if the warnings still show up?
> I can add these prints as well incase it doesnt fix the problem.
I really don't think that effectively disabling memblock_mark_kho_scratch()
when KHO is disabled will solve the problem because as I said the flags it
sets are on different structure than the flags set by
memblock_reserve_kern().
> > If you have the logs from failing boots up to the point where SLUB reports
> > about it's initialization, e.g.
> >
> > [ 0.134377] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
> >
> > something there may hint about what's the issue.
>
> So the boot doesnt fail, its just giving warnings in the fleet.
> I have added the dmesg to the end of the mail.
Thanks, unfortunately nothing jumped at me there.
> Does something like this look good? I can try deploying this (although it will take sometime to find out).
> We can get it upstream as well as that makes backports easier.
>
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 154f1d73b61f2..257c6f0eee03d 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -1119,8 +1119,13 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t
> */
> __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
> {
> - return memblock_setclr_flag(&memblock.memory, base, size, 1,
> - MEMBLOCK_KHO_SCRATCH);
> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
> + if (is_kho_boot())
Please use
if (IS_ENABLED(CONFIG_MEMBLOCK_KHO_SCRATCH)
instead of indef.
If you send a formal patch with it, I'll take it.
I'd suggest still deploying additional debug printouts internally.
> + return memblock_setclr_flag(&memblock.memory, base, size, 1,
> + MEMBLOCK_KHO_SCRATCH);
> +#else
> + return 0;
> +#endif
> }
>
> /**
> @@ -1133,8 +1138,13 @@ __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
> */
> __init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size)
> {
> - return memblock_setclr_flag(&memblock.memory, base, size, 0,
> - MEMBLOCK_KHO_SCRATCH);
> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
> + if (is_kho_boot())
> + return memblock_setclr_flag(&memblock.memory, base, size, 0,
> + MEMBLOCK_KHO_SCRATCH);
> +#else
If nothing sets the flag _clear is anyway nop, but let's update it as well
for symmetry.
--
Sincerely yours,
Mike.
Powered by blists - more mailing lists