linux-kernel - Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z9GqB1cRSD-IQM_s@dwarf.suse.cz>
Date: Wed, 12 Mar 2025 16:36:39 +0100
From: Jiri Bohac <jbohac@...e.cz>
To: David Hildenbrand <david@...hat.com>
Cc: Baoquan He <bhe@...hat.com>, Vivek Goyal <vgoyal@...hat.com>,
	Dave Young <dyoung@...hat.com>, kexec@...ts.infradead.org,
	Philipp Rudo <prudo@...hat.com>, Donald Dutile <ddutile@...hat.com>,
	Pingfan Liu <piliu@...hat.com>, Tao Liu <ltao@...hat.com>,
	linux-kernel@...r.kernel.org,
	David Hildenbrand <dhildenb@...hat.com>,
	Michal Hocko <mhocko@...e.cz>
Subject: Re: [PATCH v2 0/5] kdump: crashkernel reservation from CMA

On Mon, Mar 03, 2025 at 09:25:30AM +0100, David Hildenbrand wrote:
> On 20.02.25 17:48, Jiri Bohac wrote:
> > 
> > By reserving additional crashkernel memory from CMA, the main
> > crashkernel reservation can be just large enough to fit the
> > kernel and initrd image, minimizing the memory taken away from
> > the production system. Most of the run-time memory for the crash
> > kernel will be memory previously available to userspace in the
> > production system. As this memory is no longer wasted, the
> > reservation can be done with a generous margin, making kdump more
> > reliable. Kernel memory that we need to preserve for dumping is
> > never allocated from CMA. User data is typically not dumped by
> > makedumpfile. When dumping of user data is intended this new CMA
> > reservation cannot be used.
> 
> I'll note that your comment about "user space" is currently the case, but
> will likely not hold in the long run. The assumption you are making is that
> only user-space memory will be allocated from MIGRATE_CMA, which is not
> necessarily the case. Any movable allocation will end up in there.
> 
> Besides LRU folios (user space memory and the pagecache), we already support
> migration of some kernel allocations using the non-lru migration framework.
> Such allocations (which use __GFP_MOVABLE, see __SetPageMovable()) currently
> only include
> * memory balloon: pages we never want to dump either way
> * zsmalloc (->zpool): only used by zswap (-> compressed LRU pages)
> * z3fold (->zpool): only used by zswap (-> compressed LRU pages)
> 
> Just imagine if we support migration of other kernel allocations, such as
> user page tables. The dump would be missing important information.
> 
> Once that happens, it will become a lot harder to judge whether CMA can be
> used or not. At least, the kernel could bail out/warn for these kernel
> configs.

Thanks for ponting this out. I still don't see this as a
roadblock for my primary usecase of the CMA reservation: 
get at least some (less reliable and potentially
less useful) kdump where the user is not prepared to sacrifice
the memory needed for the standard reservation and where the only
other option is no kdump at all.

Still a lot can be analyzed with a vmcore that is missing those
__GFP_MOVABLE pages. Even if/when some user page tables are
missing.

I'll send a v3 with the documenatation part updated to better
describe this.

> > The fourth patch adds a short delay before booting the kdump
> > kernel, allowing pending DMA transfers to finish.
> 
> 
> What does "short" mean? At least in theory, long-term pinning is forbidden
> for MIGRATE_CMA, so we should not have such pages mapped into an iommu where
> DMA can happily keep going on for quite a while.

See patch 4/5 in the series:
I propose 1 second, which is a negligible time from the kdump POV
but I assume it should be plenty enough for non-long-term pins in
MIGRATE_CMA. 

> But that assumes that our old kernel is not buggy, and doesn't end up
> mapping these pages into an IOMMU where DMA will just continue. I recall
> that DRM might currently be a problem, described here [1].
>
> If kdump starts not working as expected in case our old kernel is buggy,
> doesn't that partially destroy the purpose of kdump (-> debug bugs in the
> old kernel)?

Again, this is meant as a kind of "lightweight best effort
kdump". If there is a bug that causes the crash _and_ a bug in a
driver that hogs MIGRATE_CMA and maps it into IOMMU then this
lightweight kdump may break. Then it's time to sacrifice more
memory and use a normal crashkernel reservation.

It's not like any bug in the old kernel will break it. It's a
very specific kind of bug that can potentially break it.

I see this whole thing as particularly useful for VMs. Unlike big
physical machines, where taking away a couple hundred MBs of
memory for kdump does not really hurt, a VM can ideally be given just
enough memory for its particular task. This can often be less
than 1 GB. Proper kdump reservation needs a couple hundred MBs,
so a very large proportion of the VM memory. In case of a
virtualization host running hundreds or thousands such VMs this
means a huge waste of memory. And VMs often don't use too many
drivers for real hardware, decreasing the risk of hitting a buggy
driver like this.

Thanks,

-- 
Jiri Bohac <jbohac@...e.cz>
SUSE Labs, Prague, Czechia