linux-kernel - Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <04904e86-5b5f-4aa1-a120-428dac119189@redhat.com>
Date: Mon, 3 Mar 2025 09:25:30 +0100
From: David Hildenbrand <david@...hat.com>
To: Jiri Bohac <jbohac@...e.cz>, Baoquan He <bhe@...hat.com>,
 Vivek Goyal <vgoyal@...hat.com>, Dave Young <dyoung@...hat.com>,
 kexec@...ts.infradead.org
Cc: Philipp Rudo <prudo@...hat.com>, Donald Dutile <ddutile@...hat.com>,
 Pingfan Liu <piliu@...hat.com>, Tao Liu <ltao@...hat.com>,
 linux-kernel@...r.kernel.org, David Hildenbrand <dhildenb@...hat.com>,
 Michal Hocko <mhocko@...e.cz>
Subject: Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

On 20.02.25 17:48, Jiri Bohac wrote:
> Hi,
> 
> this series implements a way to reserve additional crash kernel
> memory using CMA.
> 
> Link to the v1 discussion:
> https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
> See below for the changes since v1 and how concerns from the
> discussion have been addressed.
> 
> Currently, all the memory for the crash kernel is not usable by
> the 1st (production) kernel. It is also unmapped so that it can't
> be corrupted by the fault that will eventually trigger the crash.
> This makes sense for the memory actually used by the kexec-loaded
> crash kernel image and initrd and the data prepared during the
> load (vmcoreinfo, ...). However, the reserved space needs to be
> much larger than that to provide enough run-time memory for the
> crash kernel and the kdump userspace. Estimating the amount of
> memory to reserve is difficult. Being too careful makes kdump
> likely to end in OOM, being too generous takes even more memory
> from the production system. Also, the reservation only allows
> reserving a single contiguous block (or two with the "low"
> suffix). I've seen systems where this fails because the physical
> memory is fragmented.
> 
> By reserving additional crashkernel memory from CMA, the main
> crashkernel reservation can be just large enough to fit the
> kernel and initrd image, minimizing the memory taken away from
> the production system. Most of the run-time memory for the crash
> kernel will be memory previously available to userspace in the
> production system. As this memory is no longer wasted, the
> reservation can be done with a generous margin, making kdump more
> reliable. Kernel memory that we need to preserve for dumping is
> never allocated from CMA. User data is typically not dumped by
> makedumpfile. When dumping of user data is intended this new CMA
> reservation cannot be used.

Hi,

I'll note that your comment about "user space" is currently the case, 
but will likely not hold in the long run. The assumption you are making 
is that only user-space memory will be allocated from MIGRATE_CMA, which 
is not necessarily the case. Any movable allocation will end up in there.

Besides LRU folios (user space memory and the pagecache), we already 
support migration of some kernel allocations using the non-lru migration 
framework. Such allocations (which use __GFP_MOVABLE, see 
__SetPageMovable()) currently only include
* memory balloon: pages we never want to dump either way
* zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
* z3fold (->zpool): only used by zswap (-> compressed LRU pages)

Just imagine if we support migration of other kernel allocations, such 
as user page tables. The dump would be missing important information.

Once that happens, it will become a lot harder to judge whether CMA can 
be used or not. At least, the kernel could bail out/warn for these 
kernel configs.

> 
> There are five patches in this series:
> 
> The first adds a new ",cma" suffix to the recenly introduced generic
> crashkernel parsing code. parse_crashkernel() takes one more
> argument to store the cma reservation size.
> 
> The second patch implements reserve_crashkernel_cma() which
> performs the reservation. If the requested size is not available
> in a single range, multiple smaller ranges will be reserved.
> 
> The third patch updates Documentation/, explicitly mentioning the
> potential DMA corruption of the CMA-reserved memory.
> 
> The fourth patch adds a short delay before booting the kdump
> kernel, allowing pending DMA transfers to finish.

What does "short" mean? At least in theory, long-term pinning is 
forbidden for MIGRATE_CMA, so we should not have such pages mapped into 
an iommu where DMA can happily keep going on for quite a while.

But that assumes that our old kernel is not buggy, and doesn't end up 
mapping these pages into an IOMMU where DMA will just continue. I recall 
that DRM might currently be a problem, described here [1].

If kdump starts not working as expected in case our old kernel is buggy, 
doesn't that partially destroy the purpose of kdump (-> debug bugs in 
the old kernel)?

[1] https://lore.kernel.org/all/Z6MV_Y9WRdlBYeRs@phenom.ffwll.local/T/#u

-- 
Cheers,

David / dhildenb