linux-kernel - Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z8Z/gnbtiXT9QAZr@MiWiFi-R3L-srv>
Date: Tue, 4 Mar 2025 12:20:18 +0800
From: Baoquan He <bhe@...hat.com>
To: Donald Dutile <ddutile@...hat.com>,
	David Hildenbrand <david@...hat.com>, Jiri Bohac <jbohac@...e.cz>
Cc: Vivek Goyal <vgoyal@...hat.com>, Dave Young <dyoung@...hat.com>,
	kexec@...ts.infradead.org, Philipp Rudo <prudo@...hat.com>,
	Pingfan Liu <piliu@...hat.com>, Tao Liu <ltao@...hat.com>,
	linux-kernel@...r.kernel.org,
	David Hildenbrand <dhildenb@...hat.com>,
	Michal Hocko <mhocko@...e.cz>
Subject: Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

On 03/03/25 at 09:17am, Donald Dutile wrote:
> 
> 
> On 3/3/25 3:25 AM, David Hildenbrand wrote:
> > On 20.02.25 17:48, Jiri Bohac wrote:
> > > Hi,
> > > 
> > > this series implements a way to reserve additional crash kernel
> > > memory using CMA.
> > > 
> > > Link to the v1 discussion:
> > > https://lore.kernel.org/lkml/ZWD_fAPqEWkFlEkM@dwarf.suse.cz/
> > > See below for the changes since v1 and how concerns from the
> > > discussion have been addressed.
> > > 
> > > Currently, all the memory for the crash kernel is not usable by
> > > the 1st (production) kernel. It is also unmapped so that it can't
> > > be corrupted by the fault that will eventually trigger the crash.
> > > This makes sense for the memory actually used by the kexec-loaded
> > > crash kernel image and initrd and the data prepared during the
> > > load (vmcoreinfo, ...). However, the reserved space needs to be
> > > much larger than that to provide enough run-time memory for the
> > > crash kernel and the kdump userspace. Estimating the amount of
> > > memory to reserve is difficult. Being too careful makes kdump
> > > likely to end in OOM, being too generous takes even more memory
> > > from the production system. Also, the reservation only allows
> > > reserving a single contiguous block (or two with the "low"
> > > suffix). I've seen systems where this fails because the physical
> > > memory is fragmented.
> > > 
> > > By reserving additional crashkernel memory from CMA, the main
> > > crashkernel reservation can be just large enough to fit the
> > > kernel and initrd image, minimizing the memory taken away from
> > > the production system. Most of the run-time memory for the crash
> > > kernel will be memory previously available to userspace in the
> > > production system. As this memory is no longer wasted, the
> > > reservation can be done with a generous margin, making kdump more
> > > reliable. Kernel memory that we need to preserve for dumping is
> > > never allocated from CMA. User data is typically not dumped by
> > > makedumpfile. When dumping of user data is intended this new CMA
> > > reservation cannot be used.
> > 
> > 
> > Hi,
> > 
> > I'll note that your comment about "user space" is currently the case, but will likely not hold in the long run. The assumption you are making is that only user-space memory will be allocated from MIGRATE_CMA, which is not necessarily the case. Any movable allocation will end up in there.
> > 
> > Besides LRU folios (user space memory and the pagecache), we already support migration of some kernel allocations using the non-lru migration framework. Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently only include
> > * memory balloon: pages we never want to dump either way
> > * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
> > * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
> > 
> > Just imagine if we support migration of other kernel allocations, such as user page tables. The dump would be missing important information.
> > 
> IOMMUFD is a near-term candidate for user page tables with multi-stage iommu support with going through upstream review atm.
> Just saying, that David's case will be a norm in high-end VMs with performance-enhanced guest-driven iommu support (for GPUs).

Thank both for valuable inputs, David and Don. I agree that we may argue
not every system have ballon or enabling swap for now, while future
extending of migration on other kernel allocation could become obstacle
we can't detour.

If we have known for sure this feature could be a bad code, we may need
to stop it in advance.

Thoughts, Jiri?

> 
> > Once that happens, it will become a lot harder to judge whether CMA can be used or not. At least, the kernel could bail out/warn for these kernel configs.
> > 
> I don't think the aforementioned focus is to use CMA, but given its performance benefits, it won't take long to be the next perf improvement step taken.
> 
> > > 
> > > There are five patches in this series:
> > > 
> > > The first adds a new ",cma" suffix to the recenly introduced generic
> > > crashkernel parsing code. parse_crashkernel() takes one more
> > > argument to store the cma reservation size.
> > > 
> > > The second patch implements reserve_crashkernel_cma() which
> > > performs the reservation. If the requested size is not available
> > > in a single range, multiple smaller ranges will be reserved.
> > > 
> > > The third patch updates Documentation/, explicitly mentioning the
> > > potential DMA corruption of the CMA-reserved memory.
> > > 
> > > The fourth patch adds a short delay before booting the kdump
> > > kernel, allowing pending DMA transfers to finish.
> > 
> > 
> > What does "short" mean? At least in theory, long-term pinning is forbidden for MIGRATE_CMA, so we should not have such pages mapped into an iommu where DMA can happily keep going on for quite a while.
> > 
> Hmmm, in the case I mentioned above, should there be a kexec hook in multi-stage IOMMU support for the hypervisor/VMM to invalidate/shut-off stage 2 mappings asap (a multi-microsecond process) so
> DMA termination from VMs is stunted ?  is that already done today (due to 'simple', single-stage, device assignment in a VM)?
> 
> > But that assumes that our old kernel is not buggy, and doesn't end up mapping these pages into an IOMMU where DMA will just continue. I recall that DRM might currently be a problem, described here [1].
> > 
> > If kdump starts not working as expected in case our old kernel is buggy, doesn't that partially destroy the purpose of kdump (-> debug bugs in the old kernel)?
> > 
> > 
> > [1] https://lore.kernel.org/all/Z6MV_Y9WRdlBYeRs@phenom.ffwll.local/T/#u
> > 
>