linux-kernel - Re: [PATCH v5 09/16] kexec: enable KHO support for memory preservation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z_KnovvW7F2ZyzhX@kernel.org>
Date: Sun, 6 Apr 2025 19:11:14 +0300
From: Mike Rapoport <rppt@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Pratyush Yadav <ptyadav@...zon.de>,
	Changyuan Lyu <changyuanl@...gle.com>, linux-kernel@...r.kernel.org,
	graf@...zon.com, akpm@...ux-foundation.org, luto@...nel.org,
	anthony.yznaga@...cle.com, arnd@...db.de, ashish.kalra@....com,
	benh@...nel.crashing.org, bp@...en8.de, catalin.marinas@....com,
	dave.hansen@...ux.intel.com, dwmw2@...radead.org,
	ebiederm@...ssion.com, mingo@...hat.com, jgowans@...zon.com,
	corbet@....net, krzk@...nel.org, mark.rutland@....com,
	pbonzini@...hat.com, pasha.tatashin@...een.com, hpa@...or.com,
	peterz@...radead.org, robh+dt@...nel.org, robh@...nel.org,
	saravanak@...gle.com, skinsburskii@...ux.microsoft.com,
	rostedt@...dmis.org, tglx@...utronix.de, thomas.lendacky@....com,
	usama.arif@...edance.com, will@...nel.org,
	devicetree@...r.kernel.org, kexec@...ts.infradead.org,
	linux-arm-kernel@...ts.infradead.org, linux-doc@...r.kernel.org,
	linux-mm@...ck.org, x86@...nel.org
Subject: Re: [PATCH v5 09/16] kexec: enable KHO support for memory
 preservation

On Fri, Apr 04, 2025 at 11:30:31AM -0300, Jason Gunthorpe wrote:
> On Fri, Apr 04, 2025 at 04:53:13PM +0300, Mike Rapoport wrote:
> > > Maybe change the reserved regions code to put the region list in a
> > > folio and preserve the folio instead of using FDT as a "demo" for the
> > > functionality.
> > 
> > Folios are not available when we restore reserved regions, this just won't
> > work.
> 
> You don't need the folio at that point, you just need the data in the
> page.
> 
> The folio would be freed after starting up the buddy allocator.

Maybe, but seems a bit far fetched to me.
 
> > > We know what the future use case is for the folio preservation, all
> > > the drivers and the iommu are going to rely on this.
> > 
> > We don't know how much of the preservation will be based on folios.
> 
> I think almost all of it. Where else does memory come from for drivers?

alloc_pages()? vmalloc()?
These don't use struct folio unless there's __GFP_COMP in alloc_pages()
call, and in my mind "folio" is memory described by struct folio.
 
> > Most drivers do not use folios
> 
> Yes they do, either through kmalloc or through alloc_page/etc. "folio"
> here is just some generic word meaning memory from the buddy allocator.

How about we find some less ambiguous term? Using "folio" for memory
returned from kmalloc is really confusing. And even alloc_pages() does not
treat all memory it returns as folios.

How about we call them ranges? ;-)
 
> The big question on my mind is if we need a way to preserve slab
> objects as well..
> 
> > and for preserving memfd* and hugetlb we'd need to have some dance
> > around that memory anyway.
> 
> memfd is all folios - what do you mean?

memfd is struct folios indeed, but some of them are hugetlb and even for
those that are not I'm not sure that kho_preserve_folio(struct *folio)
kho_restore_folio(some token?) will be enough. I totally might be wrong
here.
 
> hugetlb is moving toward folios.. eg guestmemfd is supposed to be
> taking the hugetlb special stuff and turning it into folios.

At some point yes. But I really hope KHO can happen faster than hugetlb and
guestmemfd convergence.
 
> > So I think kho_preserve_folio() would be a part of the fdbox or
> > whatever that functionality will be called.
> 
> It is part of KHO. Preserving the folios has to be sequenced with
> starting the buddy allocator, and that is KHO's entire responsibility.

So if you call "folio" any memory range that comes from page allocator, I
do agree. But since it's not necessarily struct folio, and struct folio is
mostly used with file descriptors, the kho_preserve_folio(struct folio *)
API can be a part of fdbox.

Preserving struct folio is one of several case where we'd want to preserve
ranges. There's simplistic memblock case that does not care about any
memdesc, there's memory returned from alloc_pages() without __GFP_COMP,
there's vmalloc() and of course there's memory with struct folio.

But the basic KHO primitive should preserve ranges because they are the
common denominator of alloc_pages(), folio_alloc(), vmalloc() and memblock.
 
> I could see something like preserving slab being in a different layer,
> built on preserving folios.

Maybe, on top of ranges. slab is yet another memdesc.
 
> > Are they? 
> > The purpose of basic KHO is to make sure the memory we want to preserve is
> > not trampled over. Preserving folios with their orders means we need to
> > make sure memory range of the folio is preserved and we carry additional
> > information to actually recreate the folio object, in case it is needed and
> > in case it is possible. Hughetlb, for instance has its own way initializing
> > folios and just keeping the order won't be enough for that.
> 
> I expect many things will need a side-car datastructure to record that
> additional meta-data. hugetlb can start with folios, then switch them
> over to its non-folio stuff based on its metadata.
> 
> The point is the basic low level KHO mechanism is simple folios -
> memory from the buddy allocator with an neutral struct folio that the
> caller can then customize to its own memory descriptor type on restore.

I can't say I understand what do you mean by "neutral struct folio", but we
can't really use struct folio for memory that wasn't struct folio at the
first place. There's a ton of checks for flags etc in mm core that could
blow up if we use a wrong memdesc.

Hence the use of page->private for order of folios. It's stable (for now)
and can be used by any page owner.
 
> Eventually restore would allocate a caller specific memdesc and it
> wouldn't be "folios" at all. We just don't have the right words yet to
> describe this.
> 
> > As for the optimizations of memblock reserve path, currently it what hurts
> > the most in my and Pratyush experiments. They are not very representative,
> > but still, preserving lots of pages/folios spread all over would have it's
> > toll on the mm initialization.
> 
> > And I don't think invasive changes to how
> > buddy and memory map initialization are the best way to move forward and
> > optimize that.
> 
> I'm pretty sure this is going to be the best performance path, but I
> have no idea how invasive it would be to the buddy alloactor to make
> it work.

I'm not sure about the best performance, but if we are to completely bypass
memblock_reserve() we'd need an alternative memory map and free lists
initialization for KHO. I believe it's too premature to target that at this
point.
 
> > Quite possibly we'd want to be able to minimize amount of *ranges*
> > that we preserve.
> 
> I'm not sure, that seems backwards to me, we really don't want to have
> KHO mem zones! So I think optimizing for, and thinking about ranges
> doesn't make sense.

"folio" as "some generic word meaning memory from the buddy allocator" and
range are quite the same thing.
 
> The big ranges will arise naturally beacuse things like hugetlb
> reservations should all be contiguous and the resulting folios should
> all be allocated for the VM and also all be contigous. So vast, vast
> amounts of memory will be high order and contiguous.

So there won't be a problem with too many memblock_reserve() then.
 
> > Preserving folio orders with it is really straighforward and until we see
> > some real data of how the entire KHO machinery is used, I'd prefer simple
> > over anything else.
> 
> mapletree may not even work as it has a very high bound on memory
> usage if the preservation workload is small and fragmented. This is
> why I didn't want to use list of ranges in the first place.

But aren't "vast, vast amounts of memory will be high order and
contiguous."? ;-)

For small and fragmented workload bitmaps become really sparse and we are
wasting memory for nothing. Maple tree only tracks memory that is actually
used and coalesces adjacent ranges so although it's unbound in theory, in
practice it may be not that bad. 

> It also doesn't work so well if you need to preserve the order too :\

It does. In the example I've sent there's an unsigned long to store
"kho_mem_info_t", which definitely can contain order.

> Until we know the workload(s) and cost how much memory the maple tree
> version will use I don't think it is a good general starting point.
 
I did and experiment with preserving 8G of memory allocated with randomly
chosen order. For each order (0 to 10) I've got roughly 1000 "folios". I
measured time kho_mem_deserialize() takes with xarrays + bitmaps vs maple
tree based implementation. The maple tree outperformed by factor of 10 and
it's serialized data used 6 times less memory.

> Jason

-- 
Sincerely yours,
Mike.