linux-kernel - Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e9c5c247-85fb-43f1-9aa8-47d62321f37b@redhat.com>
Date: Wed, 28 May 2025 23:01:04 +0200
From: David Hildenbrand <david@...hat.com>
To: Baoquan He <bhe@...hat.com>, Donald Dutile <ddutile@...hat.com>,
 Jiri Bohac <jbohac@...e.cz>
Cc: Vivek Goyal <vgoyal@...hat.com>, Dave Young <dyoung@...hat.com>,
 kexec@...ts.infradead.org, Philipp Rudo <prudo@...hat.com>,
 Pingfan Liu <piliu@...hat.com>, Tao Liu <ltao@...hat.com>,
 linux-kernel@...r.kernel.org, David Hildenbrand <dhildenb@...hat.com>,
 Michal Hocko <mhocko@...e.cz>
Subject: Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

On 04.03.25 05:20, Baoquan He wrote:
> On 03/03/25 at 09:17am, Donald Dutile wrote:
>>
>>
>> On 3/3/25 3:25 AM, David Hildenbrand wrote:
>>> On 20.02.25 17:48, Jiri Bohac wrote:
>>>> Hi,
>>>>
>>>> this series implements a way to reserve additional crash kernel
>>>> memory using CMA.
>>>>
>>>> Link to the v1 discussion:
>>>> https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
>>>> See below for the changes since v1 and how concerns from the
>>>> discussion have been addressed.
>>>>
>>>> Currently, all the memory for the crash kernel is not usable by
>>>> the 1st (production) kernel. It is also unmapped so that it can't
>>>> be corrupted by the fault that will eventually trigger the crash.
>>>> This makes sense for the memory actually used by the kexec-loaded
>>>> crash kernel image and initrd and the data prepared during the
>>>> load (vmcoreinfo, ...). However, the reserved space needs to be
>>>> much larger than that to provide enough run-time memory for the
>>>> crash kernel and the kdump userspace. Estimating the amount of
>>>> memory to reserve is difficult. Being too careful makes kdump
>>>> likely to end in OOM, being too generous takes even more memory
>>>> from the production system. Also, the reservation only allows
>>>> reserving a single contiguous block (or two with the "low"
>>>> suffix). I've seen systems where this fails because the physical
>>>> memory is fragmented.
>>>>
>>>> By reserving additional crashkernel memory from CMA, the main
>>>> crashkernel reservation can be just large enough to fit the
>>>> kernel and initrd image, minimizing the memory taken away from
>>>> the production system. Most of the run-time memory for the crash
>>>> kernel will be memory previously available to userspace in the
>>>> production system. As this memory is no longer wasted, the
>>>> reservation can be done with a generous margin, making kdump more
>>>> reliable. Kernel memory that we need to preserve for dumping is
>>>> never allocated from CMA. User data is typically not dumped by
>>>> makedumpfile. When dumping of user data is intended this new CMA
>>>> reservation cannot be used.
>>>
>>>
>>> Hi,
>>>
>>> I'll note that your comment about "user space" is currently the case, but will likely not hold in the long run. The assumption you are making is that only user-space memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable allocation will end up in there.
>>>
>>> Besides LRU folios (user space memory and the pagecache), we already support migration of some kernel allocations using the non-lru migration framework. Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently only include
>>> * memory balloon: pages we never want to dump either way
>>> * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
>>> * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
>>>
>>> Just imagine if we support migration of other kernel allocations, such as user page tables. The dump would be missing important information.
>>>
>> IOMMUFD is a near-term candidate for user page tables with multi-stage iommu support with going through upstream review atm.
>> Just saying, that David's case will be a norm in high-end VMs with performance-enhanced guest-driven iommu support (for GPUs).
> 
> Thank both for valuable inputs, David and Don. I agree that we may argue
> not every system have ballon or enabling swap for now, while future
> extending of migration on other kernel allocation could become obstacle
> we can't detour.
> 
> If we have known for sure this feature could be a bad code, we may need
> to stop it in advance.

Sorry for the late reply.

I think we just have to be careful to document it properly -- especially 
the shortcomings and that this feature might become a problem in the 
future. Movable user-space page tables getting placed on CMA memory 
would probably not be a problem if we don't care about ... user-space 
data either way.

The whole "Direct I/O takes max 1s" part is a bit shaky. Maybe it could 
be configurable how long to wait? 10s is certainly "safer".

But maybe, in the target use case: VMs, direct I/O will not be that common.

-- 
Cheers,

David / dhildenb