linux-kernel - Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250407141626.GB1557073@nvidia.com>
Date: Mon, 7 Apr 2025 11:16:26 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Mike Rapoport <rppt@...nel.org>
Cc: Pratyush Yadav <ptyadav@...zon.de>,
	Changyuan Lyu <changyuanl@...gle.com>, linux-kernel@...r.kernel.org,
	graf@...zon.com, akpm@...ux-foundation.org, luto@...nel.org,
	anthony.yznaga@...cle.com, arnd@...db.de, ashish.kalra@....com,
	benh@...nel.crashing.org, bp@...en8.de, catalin.marinas@....com,
	dave.hansen@...ux.intel.com, dwmw2@...radead.org,
	ebiederm@...ssion.com, mingo@...hat.com, jgowans@...zon.com,
	corbet@....net, krzk@...nel.org, mark.rutland@....com,
	pbonzini@...hat.com, pasha.tatashin@...een.com, hpa@...or.com,
	peterz@...radead.org, robh+dt@...nel.org, robh@...nel.org,
	saravanak@...gle.com, skinsburskii@...ux.microsoft.com,
	rostedt@...dmis.org, tglx@...utronix.de, thomas.lendacky@....com,
	usama.arif@...edance.com, will@...nel.org,
	devicetree@...r.kernel.org, kexec@...ts.infradead.org,
	linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
	linux-mm@...ck.org, x86@...nel.org
Subject: Re: [PATCH v5 09/16] kexec: enable KHO support for memory
 preservation

On Sun, Apr 06, 2025 at 07:11:14PM +0300, Mike Rapoport wrote:
> > > > We know what the future use case is for the folio preservation, all
> > > > the drivers and the iommu are going to rely on this.
> > > 
> > > We don't know how much of the preservation will be based on folios.
> > 
> > I think almost all of it. Where else does memory come from for drivers?
> 
> alloc_pages()? vmalloc()?

alloc_pages is a 0 order "folio". vmalloc is an array of 0 order
folios (?)

> These don't use struct folio unless there's __GFP_COMP in alloc_pages()
> call, and in my mind "folio" is memory described by struct folio.

I understand Matthew wants to get rid of non __GFP_COMP usage.

> > > Most drivers do not use folios
> > 
> > Yes they do, either through kmalloc or through alloc_page/etc. "folio"
> > here is just some generic word meaning memory from the buddy allocator.
> 
> How about we find some less ambiguous term? Using "folio" for memory
> returned from kmalloc is really confusing. And even alloc_pages() does not
> treat all memory it returns as folios.
> 
> How about we call them ranges? ;-)

memdescs if you want to be forward looking. It is not ranges.

The point very much is that they are well defined allocations from the
buddy allocator that can be freed back to the buddy allocator. We
provide an API sort of like alloc_pages/folio_alloc to get the pointer
back out and that is the only way to use it.

> > > and for preserving memfd* and hugetlb we'd need to have some dance
> > > around that memory anyway.
> > 
> > memfd is all folios - what do you mean?
> 
> memfd is struct folios indeed, but some of them are hugetlb and even for
> those that are not I'm not sure that kho_preserve_folio(struct *folio)
> kho_restore_folio(some token?) will be enough. I totally might be wrong
> here.

Well, that is the point, we want it to be enough and we need to make
it work. Ranges is the wrong place to fall back on to if there are
problems.

> > hugetlb is moving toward folios.. eg guestmemfd is supposed to be
> > taking the hugetlb special stuff and turning it into folios.
> 
> At some point yes. But I really hope KHO can happen faster than hugetlb and
> guestmemfd convergence.

Regardless, it is still representable as a near-folio thing since
there are struct pages backing hugetlbfs.

> > > So I think kho_preserve_folio() would be a part of the fdbox or
> > > whatever that functionality will be called.
> > 
> > It is part of KHO. Preserving the folios has to be sequenced with
> > starting the buddy allocator, and that is KHO's entire responsibility.
> 
> So if you call "folio" any memory range that comes from page allocator, I
> do agree.

Yes

> But since it's not necessarily struct folio, and struct folio is
> mostly used with file descriptors, the kho_preserve_folio(struct folio *)
> API can be a part of fdbox.

KHO needs to provide a way to give back an allocated struct page/folio
that can be freed back to the buddy alloactor, of the proper
order. Whatever you call that function it belongs to KHO as it is
KHO's primary responsibility to manage the buddy allocator and the
struct pages.

Today initializing the folio is the work required to do that.
 
> Preserving struct folio is one of several case where we'd want to preserve
> ranges. There's simplistic memblock case that does not care about any
> memdesc, there's memory returned from alloc_pages() without __GFP_COMP,
> there's vmalloc() and of course there's memory with struct folio.

non-struct page memory is fundamentally different from struct-page
memory, we don't even start up the buddy allocator on non-struct page
memory, and we don't allocate struct pages for them.

This should be a completely different flow.

Buddy allocator memory should start up in the next kernel as allocate
multi-order "folios", with allocated struct pages, with a working
folio_put()/etc to free them.

> I can't say I understand what do you mean by "neutral struct folio", but we
> can't really use struct folio for memory that wasn't struct folio at the
> first place. There's a ton of checks for flags etc in mm core that could
> blow up if we use a wrong memdesc.

For instance go look at how slab gets memory from the allocator:

        folio = (struct folio *)alloc_frozen_pages(flags, order);
	slab = folio_slab(folio);
        __folio_set_slab(folio);

I know the naming is tortured, but this is how it works right now. You
allocate "netrual" folios, then you change them into your memdesc
specific subtype. And somehow we simultaneously call this thing page,
folio and slab memdesc :\

So for preservation it makes complete sense that you'd have a
'kho_restore_frozen_folios/pages()' that returns a struct page/struct
folio in the exact same state as though it was newly allocated.

> "folio" as "some generic word meaning memory from the buddy allocator" and
> range are quite the same thing.

Not quite, folios are constrained to be aligned powers of two and we
expect the order to round trip through the system.

'ranges' are just ranges, no implied alingment, no round tripping of
the order.

> > > Preserving folio orders with it is really straighforward and until we see
> > > some real data of how the entire KHO machinery is used, I'd prefer simple
> > > over anything else.
> > 
> > mapletree may not even work as it has a very high bound on memory
> > usage if the preservation workload is small and fragmented. This is
> > why I didn't want to use list of ranges in the first place.
> 
> But aren't "vast, vast amounts of memory will be high order and
> contiguous."? ;-)

Yes, if you have a 500GB host most likely something like 480GB will be
high order contiguous, then you have 20GB that has to be random
layout. That is still alot of memory to eat up in a discontinuous
maple tree.

> > It also doesn't work so well if you need to preserve the order too :\
> 
> It does. In the example I've sent there's an unsigned long to store
> "kho_mem_info_t", which definitely can contain order.

It mucks up the combining logic since you can't combine maple tree
nodes with different orders, and now you have defeated the main
argument of using ranges :\

> > Until we know the workload(s) and cost how much memory the maple tree
> > version will use I don't think it is a good general starting point.
>  
> I did and experiment with preserving 8G of memory allocated with randomly
> chosen order. For each order (0 to 10) I've got roughly 1000 "folios". I
> measured time kho_mem_deserialize() takes with xarrays + bitmaps vs maple
> tree based implementation. The maple tree outperformed by factor of 10 and
> it's serialized data used 6 times less memory.

That seems like it means most of your memory ended up contiguous and
the maple tree didn't split nodes to preserve order. :\ Also the
bitmap scanning to optimize the memblock reserve isn't implemented for
xarray.. I don't think this is representative..

Jason