[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d38aeb5d-5a46-4e8d-b3c1-f2b4cdb15b04@gmail.com>
Date: Wed, 26 Nov 2025 07:25:48 +0000
From: Usama Arif <usamaarif642@...il.com>
To: Mike Rapoport <rppt@...nel.org>
Cc: Pratyush Yadav <pratyush@...nel.org>,
Changyuan Lyu <changyuanl@...gle.com>, akpm@...ux-foundation.org,
linux-kernel@...r.kernel.org, anthony.yznaga@...cle.com, arnd@...db.de,
ashish.kalra@....com, benh@...nel.crashing.org, bp@...en8.de,
catalin.marinas@....com, corbet@....net, dave.hansen@...ux.intel.com,
devicetree@...r.kernel.org, dwmw2@...radead.org, ebiederm@...ssion.com,
graf@...zon.com, hpa@...or.com, jgowans@...zon.com,
kexec@...ts.infradead.org, krzk@...nel.org,
linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
linux-mm@...ck.org, luto@...nel.org, mark.rutland@....com, mingo@...hat.com,
pasha.tatashin@...een.com, pbonzini@...hat.com, peterz@...radead.org,
robh@...nel.org, rostedt@...dmis.org, saravanak@...gle.com,
skinsburskii@...ux.microsoft.com, tglx@...utronix.de,
thomas.lendacky@....com, will@...nel.org, x86@...nel.org,
Breno Leitao <leitao@...ian.org>, thevlad@...a.com
Subject: Re: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for
memory below 1M
On 26/11/2025 06:14, Mike Rapoport wrote:
> On Tue, Nov 25, 2025 at 06:47:15PM +0000, Usama Arif wrote:
>>
>>
>> On 25/11/2025 13:50, Mike Rapoport wrote:
>>> Hi,
>>>
>>> On Tue, Nov 25, 2025 at 02:15:34PM +0100, Pratyush Yadav wrote:
>>>> On Mon, Nov 24 2025, Usama Arif wrote:
>>>
>>>>>> --- a/arch/x86/realmode/init.c
>>>>>> +++ b/arch/x86/realmode/init.c
>>>>>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void)
>>>>>> * setup_arch().
>>>>>> */
>>>>>> memblock_reserve(0, SZ_1M);
>>>>>> +
>>>>>> + memblock_clear_kho_scratch(0, SZ_1M);
>>>>>> }
>>>>>>
>>>>>> static void __init sme_sev_setup_real_mode(struct trampoline_header *th)
>>>>>
>>>>> Hello!
>>>>>
>>>>> I am working with Breno who reported that we are seeing the below warning at boot
>>>>> when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host
>>>>> manually but we are seeing this several times a day inside the fleet.
>>>>>
>>>>> 20:16:33 ------------[ cut here ]------------
>>>>> 20:16:33 WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330
>>>>> 20:16:33 Modules linked in:
>>>>> 20:16:33 CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S 6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE
>>>>> 20:16:33 Tainted: [S]=CPU_OUT_OF_SPEC
>>>>> 20:16:33 RIP: 0010:memblock_add_range+0x316/0x330
>>>>> 20:16:33 Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc
>>>>> 20:16:33 RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000
>>>>> 20:16:33 RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002
>>>>> 20:16:33 RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8
>>>>> 20:16:33 RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101
>>>>> 20:16:33 R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00
>>>>> 20:16:33 R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000
>>>>> 20:16:33 FS: 0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
>>>>> 20:16:33 CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0
>>>>> 20:16:33 Call Trace:
>>>>> 20:16:33 <TASK>
>>>>> 20:16:33 ? __memblock_reserve+0x75/0x80
>>>
>>> Do you have faddr2line for this?
>>>>>> 20:16:33 ? setup_arch+0x30f/0xb10
>>>
>>> And this?
>>>
>>
>>
>> Thanks for this! I think it helped narrow down the problem.
>>
>> The stack is:
>>
>> 20:16:33 ? __memblock_reserve (mm/memblock.c:936)
>> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956)
>> 20:16:33 ? start_kernel (init/main.c:922)
>> 20:16:33 ? x86_64_start_reservations (arch/x86/kernel/ebda.c:57)
>> 20:16:33 ? x86_64_start_kernel (arch/x86/kernel/head64.c:231)
>> 20:16:33 ? common_startup_64 (arch/x86/kernel/head_64.S:419)
>>
>> This is 6.16 kernel.
>>
>> 20:16:33 ? __memblock_reserve (mm/memblock.c:936)
>> Thats memblock_add_range call in memblock_reserve
>>
>> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956)
>> That is parse_setup_data -> add_early_ima_buffer -> add_early_ima_buffer -> memblock_reserve_kern
>>
>>
>> I put a simple print like below:
>>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 680d1b6dfea41..cc97ffc0083c7 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -409,6 +409,7 @@ static void __init add_early_ima_buffer(u64 phys_addr)
>> }
>>
>> if (data->size) {
>> + pr_err("PPP %s %s %d data->addr %llx, data->size %llx \n", __FILE__, __func__, __LINE__, data->addr, data->size);
>> memblock_reserve_kern(data->addr, data->size);
>> ima_kexec_buffer_phys = data->addr;
>> ima_kexec_buffer_size = data->size;
>>
>>
>> and I see (without replicating the warning):
>>
>> [ 0.000000] PPP arch/x86/kernel/setup.c add_early_ima_buffer 412 data->addr 9e000, data->size 1000
>> ....
>
> So it looks like in cases when the warning reproduces there's something
> that reserves memory overlapping with IMA buffer before
> add_early_ima_buffer().
>
>>
>> [ 0.000348] MEMBLOCK configuration:
>> [ 0.000348] memory size = 0x0000003fea329ff0 reserved size = 0x00000000050c969b
>> [ 0.000350] memory.cnt = 0x5
>> [ 0.000351] memory[0x0] [0x0000000000001000-0x000000000009ffff], 0x000000000009f000 bytes flags: 0x40
>> [ 0.000353] memory[0x1] [0x0000000000100000-0x0000000067c65fff], 0x0000000067b66000 bytes flags: 0x0
>> [ 0.000355] memory[0x2] [0x000000006d8db000-0x000000006fffffff], 0x0000000002725000 bytes flags: 0x0
>> [ 0.000356] memory[0x3] [0x0000000100000000-0x000000407fff8fff], 0x0000003f7fff9000 bytes flags: 0x0
>> [ 0.000358] memory[0x4] [0x000000407fffa000-0x000000407fffffff], 0x0000000000006000 bytes flags: 0x0
>> [ 0.000359] reserved.cnt = 0x7
>>
>>
>> So MEMBLOCK_RSRV_KERN and MEMBLOCK_KHO_SCRATCH seem to overlap..
>
> It does not matter, they are set on different arrays. RSRV_KERN is set on
> regions in memblock.reserved and KHO_SCRATCH is set on regions in
> memblock.memory.
>
> So dumping memblock.memory is completely irrelevant, you need to check
> memblock.reserved for potential conflicts.
>
>>>>> 20:16:33 ? start_kernel+0x58/0x960
>>>>> 20:16:33 ? x86_64_start_reservations+0x20/0x20
>>>>> 20:16:33 ? x86_64_start_kernel+0x13d/0x140
>>>>> 20:16:33 ? common_startup_64+0x13e/0x140
>>>>> 20:16:33 </TASK>
>>>>> 20:16:33 ---[ end trace 0000000000000000 ]---
>>>>>
>>>>>
>>>>> Rolling out with memblock=debug is not really an option in a large scale fleet due to the
>>>>> time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see:
>>>
>>> Is it a problem to roll out a kernel that has additional debug printouts as
>>> Breno suggested earlier? I.e.
>>>
>>> if (flags != MEMBLOCK_NONE && flags != rgn->flags) {
>>> pr_warn("memblock: Flag mismatch at region [%pa-%pa]\n",
>>> &rgn->base, &rend);
>>> pr_warn(" Existing region flags: %#x\n", rgn->flags);
>>> pr_warn(" New range flags: %#x\n", flags);
>>> pr_warn(" New range: [%pa-%pa]\n", &base, &end);
>>> WARN_ON_ONCE(1);
>>> }
>>>
>>
>> I can add this, but the only thing is that it might be several weeks between me putting this in the
>> kernel and that kernel being deployed to enough machines that it starts to show up. I think the IMA coinciding
>> with memblock_mark_kho_scratch in e820__memblock_setup could be the reason for the warning. It might be better to
>> fix that case and deploy it to see if the warnings still show up?
>> I can add these prints as well incase it doesnt fix the problem.
>
> I really don't think that effectively disabling memblock_mark_kho_scratch()
> when KHO is disabled will solve the problem because as I said the flags it
> sets are on different structure than the flags set by
> memblock_reserve_kern().
>
>>> If you have the logs from failing boots up to the point where SLUB reports
>>> about it's initialization, e.g.
>>>
>>> [ 0.134377] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
>>>
>>> something there may hint about what's the issue.
>>
>> So the boot doesnt fail, its just giving warnings in the fleet.
>> I have added the dmesg to the end of the mail.
>
> Thanks, unfortunately nothing jumped at me there.
>
>> Does something like this look good? I can try deploying this (although it will take sometime to find out).
>> We can get it upstream as well as that makes backports easier.
>>
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 154f1d73b61f2..257c6f0eee03d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1119,8 +1119,13 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t
>> */
>> __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
>> {
>> - return memblock_setclr_flag(&memblock.memory, base, size, 1,
>> - MEMBLOCK_KHO_SCRATCH);
>> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
>> + if (is_kho_boot())
>
> Please use
>
> if (IS_ENABLED(CONFIG_MEMBLOCK_KHO_SCRATCH)
>
> instead of indef.
>
> If you send a formal patch with it, I'll take it.
> I'd suggest still deploying additional debug printouts internally.
Thanks! I will add the additional debug prints and [1] in the next release.
It will be sometime before it makes it into production, so I will try to debug
this more using the information you provided above.
[1] https://lore.kernel.org/all/20251126072051.546700-1-usamaarif642@gmail.com/
>
>> + return memblock_setclr_flag(&memblock.memory, base, size, 1,
>> + MEMBLOCK_KHO_SCRATCH);
>> +#else
>> + return 0;
>> +#endif
>> }
>>
>> /**
>> @@ -1133,8 +1138,13 @@ __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
>> */
>> __init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size)
>> {
>> - return memblock_setclr_flag(&memblock.memory, base, size, 0,
>> - MEMBLOCK_KHO_SCRATCH);
>> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
>> + if (is_kho_boot())
>> + return memblock_setclr_flag(&memblock.memory, base, size, 0,
>> + MEMBLOCK_KHO_SCRATCH);
>> +#else
>
> If nothing sets the flag _clear is anyway nop, but let's update it as well
> for symmetry.
>
Powered by blists - more mailing lists