linux-kernel - Re: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for memory below 1M

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d38aeb5d-5a46-4e8d-b3c1-f2b4cdb15b04@gmail.com>
Date: Wed, 26 Nov 2025 07:25:48 +0000
From: Usama Arif <usamaarif642@...il.com>
To: Mike Rapoport <rppt@...nel.org>
Cc: Pratyush Yadav <pratyush@...nel.org>,
 Changyuan Lyu <changyuanl@...gle.com>, akpm@...ux-foundation.org,
 linux-kernel@...r.kernel.org, anthony.yznaga@...cle.com, arnd@...db.de,
 ashish.kalra@....com, benh@...nel.crashing.org, bp@...en8.de,
 catalin.marinas@....com, corbet@....net, dave.hansen@...ux.intel.com,
 devicetree@...r.kernel.org, dwmw2@...radead.org, ebiederm@...ssion.com,
 graf@...zon.com, hpa@...or.com, jgowans@...zon.com,
 kexec@...ts.infradead.org, krzk@...nel.org,
 linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
 linux-mm@...ck.org, luto@...nel.org, mark.rutland@....com, mingo@...hat.com,
 pasha.tatashin@...een.com, pbonzini@...hat.com, peterz@...radead.org,
 robh@...nel.org, rostedt@...dmis.org, saravanak@...gle.com,
 skinsburskii@...ux.microsoft.com, tglx@...utronix.de,
 thomas.lendacky@....com, will@...nel.org, x86@...nel.org,
 Breno Leitao <leitao@...ian.org>, thevlad@...a.com
Subject: Re: [PATCH v8 12/17] x86/e820: temporarily enable KHO scratch for
 memory below 1M



On 26/11/2025 06:14, Mike Rapoport wrote:
> On Tue, Nov 25, 2025 at 06:47:15PM +0000, Usama Arif wrote:
>>
>>
>> On 25/11/2025 13:50, Mike Rapoport wrote:
>>> Hi,
>>>
>>> On Tue, Nov 25, 2025 at 02:15:34PM +0100, Pratyush Yadav wrote:
>>>> On Mon, Nov 24 2025, Usama Arif wrote:
>>>
>>>>>> --- a/arch/x86/realmode/init.c
>>>>>> +++ b/arch/x86/realmode/init.c
>>>>>> @@ -65,6 +65,8 @@ void __init reserve_real_mode(void)
>>>>>>  	 * setup_arch().
>>>>>>  	 */
>>>>>>  	memblock_reserve(0, SZ_1M);
>>>>>> +
>>>>>> +	memblock_clear_kho_scratch(0, SZ_1M);
>>>>>>  }
>>>>>>  
>>>>>>  static void __init sme_sev_setup_real_mode(struct trampoline_header *th)
>>>>>
>>>>> Hello!
>>>>>
>>>>> I am working with Breno who reported that we are seeing the below warning at boot
>>>>> when rolling out 6.16 in Meta fleet. It is difficult to reproduce on a single host
>>>>> manually but we are seeing this several times a day inside the fleet.
>>>>>
>>>>>  20:16:33  ------------[ cut here ]------------
>>>>>  20:16:33  WARNING: CPU: 0 PID: 0 at mm/memblock.c:668 memblock_add_range+0x316/0x330
>>>>>  20:16:33  Modules linked in:
>>>>>  20:16:33  CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G S                  6.16.1-0_fbk0_0_gc0739ee5037a #1 NONE 
>>>>>  20:16:33  Tainted: [S]=CPU_OUT_OF_SPEC
>>>>>  20:16:33  RIP: 0010:memblock_add_range+0x316/0x330
>>>>>  20:16:33  Code: ff ff ff 89 5c 24 08 41 ff c5 44 89 6c 24 10 48 63 74 24 08 48 63 54 24 10 e8 26 0c 00 00 e9 41 ff ff ff 0f 0b e9 af fd ff ff <0f> 0b e9 b7 fd ff ff 0f 0b 0f 0b cc cc cc cc cc cc cc cc cc cc cc
>>>>>  20:16:33  RSP: 0000:ffffffff83403dd8 EFLAGS: 00010083 ORIG_RAX: 0000000000000000
>>>>>  20:16:33  RAX: ffffffff8476ff90 RBX: 0000000000001c00 RCX: 0000000000000002
>>>>>  20:16:33  RDX: 00000000ffffffff RSI: 0000000000000000 RDI: ffffffff83bad4d8
>>>>>  20:16:33  RBP: 000000000009f000 R08: 0000000000000020 R09: 8000000000097101
>>>>>  20:16:33  R10: ffffffffff2004b0 R11: 203a6d6f646e6172 R12: 000000000009ec00
>>>>>  20:16:33  R13: 0000000000000002 R14: 0000000000100000 R15: 000000000009d000
>>>>>  20:16:33  FS:  0000000000000000(0000) GS:0000000000000000(0000) knlGS:0000000000000000
>>>>>  20:16:33  CR2: ffff888065413ff8 CR3: 00000000663b7000 CR4: 00000000000000b0
>>>>>  20:16:33  Call Trace:
>>>>>  20:16:33   <TASK>
>>>>>  20:16:33   ? __memblock_reserve+0x75/0x80
>>>
>>> Do you have faddr2line for this?
>>>>>>  20:16:33   ? setup_arch+0x30f/0xb10
>>>
>>> And this?
>>>
>>
>>
>> Thanks for this! I think it helped narrow down the problem.
>>
>> The stack is:
>>
>> 20:16:33 ? __memblock_reserve (mm/memblock.c:936) 
>> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) 
>> 20:16:33 ? start_kernel (init/main.c:922) 
>> 20:16:33 ? x86_64_start_reservations (arch/x86/kernel/ebda.c:57) 
>> 20:16:33 ? x86_64_start_kernel (arch/x86/kernel/head64.c:231) 
>> 20:16:33 ? common_startup_64 (arch/x86/kernel/head_64.S:419) 
>>
>> This is 6.16 kernel.
>>
>> 20:16:33 ? __memblock_reserve (mm/memblock.c:936) 
>> Thats memblock_add_range call in memblock_reserve
>>
>> 20:16:33 ? setup_arch (arch/x86/kernel/setup.c:413 arch/x86/kernel/setup.c:499 arch/x86/kernel/setup.c:956) 
>> That is parse_setup_data -> add_early_ima_buffer -> add_early_ima_buffer -> memblock_reserve_kern
>>
>>
>> I put a simple print like below:
>>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 680d1b6dfea41..cc97ffc0083c7 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -409,6 +409,7 @@ static void __init add_early_ima_buffer(u64 phys_addr)
>>         }
>>  
>>         if (data->size) {
>> +               pr_err("PPP %s %s %d data->addr %llx, data->size %llx \n", __FILE__, __func__, __LINE__, data->addr, data->size);
>>                 memblock_reserve_kern(data->addr, data->size);
>>                 ima_kexec_buffer_phys = data->addr;
>>                 ima_kexec_buffer_size = data->size;
>>
>>
>> and I see (without replicating the warning):
>>
>> [    0.000000] PPP arch/x86/kernel/setup.c add_early_ima_buffer 412 data->addr 9e000, data->size 1000                                                                                          
>> ....
> 
> So it looks like in cases when the warning reproduces there's something
> that reserves memory overlapping with IMA buffer before
> add_early_ima_buffer().
> 
>>
>> [    0.000348] MEMBLOCK configuration:
>> [    0.000348]  memory size = 0x0000003fea329ff0 reserved size = 0x00000000050c969b
>> [    0.000350]  memory.cnt  = 0x5
>> [    0.000351]  memory[0x0]     [0x0000000000001000-0x000000000009ffff], 0x000000000009f000 bytes flags: 0x40
>> [    0.000353]  memory[0x1]     [0x0000000000100000-0x0000000067c65fff], 0x0000000067b66000 bytes flags: 0x0
>> [    0.000355]  memory[0x2]     [0x000000006d8db000-0x000000006fffffff], 0x0000000002725000 bytes flags: 0x0
>> [    0.000356]  memory[0x3]     [0x0000000100000000-0x000000407fff8fff], 0x0000003f7fff9000 bytes flags: 0x0
>> [    0.000358]  memory[0x4]     [0x000000407fffa000-0x000000407fffffff], 0x0000000000006000 bytes flags: 0x0
>> [    0.000359]  reserved.cnt  = 0x7
>>
>>
>> So MEMBLOCK_RSRV_KERN and MEMBLOCK_KHO_SCRATCH seem to overlap..
> 
> It does not matter, they are set on different arrays. RSRV_KERN is set on
> regions in memblock.reserved and KHO_SCRATCH is set on regions in
> memblock.memory.
> 
> So dumping memblock.memory is completely irrelevant, you need to check
> memblock.reserved for potential conflicts.
>  
>>>>>  20:16:33   ? start_kernel+0x58/0x960
>>>>>  20:16:33   ? x86_64_start_reservations+0x20/0x20
>>>>>  20:16:33   ? x86_64_start_kernel+0x13d/0x140
>>>>>  20:16:33   ? common_startup_64+0x13e/0x140
>>>>>  20:16:33   </TASK>
>>>>>  20:16:33  ---[ end trace 0000000000000000 ]--- 
>>>>>
>>>>>
>>>>> Rolling out with memblock=debug is not really an option in a large scale fleet due to the
>>>>> time added to boot. But I did try on one of the hosts (without reproducing the issue) and I see:
>>>
>>> Is it a problem to roll out a kernel that has additional debug printouts as
>>> Breno suggested earlier? I.e.
>>>
>>> 	if (flags != MEMBLOCK_NONE && flags != rgn->flags) {
>>> 		pr_warn("memblock: Flag mismatch at region [%pa-%pa]\n",
>>> 			&rgn->base, &rend);
>>> 		pr_warn("  Existing region flags: %#x\n", rgn->flags);
>>> 		pr_warn("  New range flags: %#x\n", flags);
>>> 		pr_warn("  New range: [%pa-%pa]\n", &base, &end);
>>> 		WARN_ON_ONCE(1);
>>> 	}
>>>
>>
>> I can add this, but the only thing is that it might be several weeks between me putting this in the
>> kernel and that kernel being deployed to enough machines that it starts to show up. I think the IMA coinciding
>> with memblock_mark_kho_scratch in e820__memblock_setup could be the reason for the warning. It might be better to
>> fix that case and deploy it to see if the warnings still show up?
>> I can add these prints as well incase it doesnt fix the problem.
>  
> I really don't think that effectively disabling memblock_mark_kho_scratch()
> when KHO is disabled will solve the problem because as I said the flags it
> sets are on different structure than the flags set by
> memblock_reserve_kern().
> 
>>> If you have the logs from failing boots up to the point where SLUB reports
>>> about it's initialization, e.g. 
>>>
>>> [    0.134377] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
>>>
>>> something there may hint about what's the issue. 
>>
>> So the boot doesnt fail, its just giving warnings in the fleet.
>> I have added the dmesg to the end of the mail.
>  
> Thanks, unfortunately nothing jumped at me there.
> 
>> Does something like this look good? I can try deploying this (although it will take sometime to find out).
>> We can get it upstream as well as that makes backports easier.
>>
>> diff --git a/mm/memblock.c b/mm/memblock.c
>> index 154f1d73b61f2..257c6f0eee03d 100644
>> --- a/mm/memblock.c
>> +++ b/mm/memblock.c
>> @@ -1119,8 +1119,13 @@ int __init_memblock memblock_reserved_mark_noinit(phys_addr_t base, phys_addr_t
>>   */
>>  __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
>>  {
>> -       return memblock_setclr_flag(&memblock.memory, base, size, 1,
>> -                                   MEMBLOCK_KHO_SCRATCH);
>> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
>> +       if (is_kho_boot())
> 
> Please use 
> 
> 	if (IS_ENABLED(CONFIG_MEMBLOCK_KHO_SCRATCH)
> 
> instead of indef.
> 
> If you send a formal patch with it, I'll take it.
> I'd suggest still deploying additional debug printouts internally.


Thanks! I will add the additional debug prints and [1] in the next release.
It will be sometime before it makes it into production, so I will try to debug
this more using the information you provided above.

[1] https://lore.kernel.org/all/20251126072051.546700-1-usamaarif642@gmail.com/

> 
>> +               return memblock_setclr_flag(&memblock.memory, base, size, 1,
>> +                                           MEMBLOCK_KHO_SCRATCH);
>> +#else
>> +       return 0;
>> +#endif
>>  }
>>  
>>  /**
>> @@ -1133,8 +1138,13 @@ __init int memblock_mark_kho_scratch(phys_addr_t base, phys_addr_t size)
>>   */
>>  __init int memblock_clear_kho_scratch(phys_addr_t base, phys_addr_t size)
>>  {
>> -       return memblock_setclr_flag(&memblock.memory, base, size, 0,
>> -                                   MEMBLOCK_KHO_SCRATCH);
>> +#ifdef CONFIG_MEMBLOCK_KHO_SCRATCH
>> +       if (is_kho_boot())
>> +               return memblock_setclr_flag(&memblock.memory, base, size, 0,
>> +                                           MEMBLOCK_KHO_SCRATCH);
>> +#else
> 
> If nothing sets the flag _clear is anyway nop, but let's update it as well
> for symmetry.
>