linux-kernel - Re: [PATCH v4] kexec: Enable CMA based contiguous allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e35d0f10-74d3-40c3-a43c-0e96edf3a121@amazon.com>
Date: Mon, 26 May 2025 07:48:58 +0200
From: Alexander Graf <graf@...zon.com>
To: Baoquan He <bhe@...hat.com>
CC: <kexec@...ts.infradead.org>, <linux-kernel@...r.kernel.org>, "Pasha
 Tatashin" <pasha.tatashin@...een.com>, <nh-open-source@...zon.com>, "Zhongkun
 He" <hezhongkun.hzk@...edance.com>
Subject: Re: [PATCH v4] kexec: Enable CMA based contiguous allocation


On 26.05.25 05:09, Baoquan He wrote:
> On 05/21/25 at 03:29pm, Alexander Graf wrote:
>> When booting a new kernel with kexec_file, the kernel picks a target
>> location that the kernel should live at, then allocates random pages,
>> checks whether any of those patches magically happens to coincide with
>> a target address range and if so, uses them for that range.
>>
>> For every page allocated this way, it then creates a page list that the
>> relocation code - code that executes while all CPUs are off and we are
>> just about to jump into the new kernel - copies to their final memory
>> location. We can not put them there before, because chances are pretty
>> good that at least some page in the target range is already in use by
>> the currently running Linux environment. Copying is happening from a
>> single CPU at RAM rate, which takes around 4-50 ms per 100 MiB.
>>
>> All of this is inefficient and error prone.
>>
>> To successfully kexec, we need to quiesce all devices of the outgoing
>> kernel so they don't scribble over the new kernel's memory. We have seen
>> cases where that does not happen properly (*cough* GIC *cough*) and hence
>> the new kernel was corrupted. This started a month long journey to root
>> cause failing kexecs to eventually see memory corruption, because the new
>> kernel was corrupted severely enough that it could not emit output to
>> tell us about the fact that it was corrupted. By allocating memory for the
>> next kernel from a memory range that is guaranteed scribbling free, we can
>> boot the next kernel up to a point where it is at least able to detect
>> corruption and maybe even stop it before it becomes severe. This increases
>> the chance for successful kexecs.
>>
>> Since kexec got introduced, Linux has gained the CMA framework which
>> can perform physically contiguous memory mappings, while keeping that
>> memory available for movable memory when it is not needed for contiguous
>> allocations. The default CMA allocator is for DMA allocations.
>>
>> This patch adds logic to the kexec file loader to attempt to place the
>> target payload at a location allocated from CMA. If successful, it uses
>> that memory range directly instead of creating copy instructions during
>> the hot phase. To ensure that there is a safety net in case anything goes
>> wrong with the CMA allocation, it also adds a flag for user space to force
>> disable CMA allocations.
>>
>> Using CMA allocations has two advantages:
>>
>>    1) Faster by 4-50 ms per 100 MiB. There is no more need to copy in the
>>       hot phase.
> Wondering at what stage this 'fater by 4-50ms per 100MB' is got. Usually
> kernel iamge + initrd won't be more than 100MB, and if system is running
> and memory is allocated heavily, kexec loading could meet migration in
> CMA area.


This patch optimizes the handover. Loading the kexec image is not really 
time critical: Your system still functions while you perform a kexec 
load. It's the time when you do the jump that you want to do as little 
as possible work in to accelerate a kexec based update flow.


>
>>    2) More robust. Even if by accident some page is still in use for DMA,
>>       the new kernel image will be safe from that access because it resides
>>       in a memory region that is considered allocated in the old kernel and
>>       has a chance to reinitialize that component.
> Yeah, this is the significant benefit in view of some driver lacking
> .shutdown likely collapsing kexec rebooted kernel. The thing is system
> with heavily allocating memory could fail to allocate memory from CMA
> due to migration failure, and some system may even not have CMA area.


Correct. In those cases, the load falls back to the current scheme of 
allocating random memory. All we're doing here is shift the odds - both 
of executing the kexec quicker but also of less overlap with potentially 
still in use pages.


Alex





Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597