linux-kernel - Re: [PATCH] kho: add support for deferred struct page init

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+CK2bDxnTEe9Ohq5zLuyF-jqgD0DPhfdq6z=yztUsXU5p5fSQ@mail.gmail.com>
Date: Mon, 22 Dec 2025 10:55:34 -0500
From: Pasha Tatashin <pasha.tatashin@...een.com>
To: Pratyush Yadav <pratyush@...nel.org>
Cc: Mike Rapoport <rppt@...nel.org>, Evangelos Petrongonas <epetron@...zon.de>, Alexander Graf <graf@...zon.com>, 
	Andrew Morton <akpm@...ux-foundation.org>, Jason Miu <jasonmiu@...gle.com>, 
	linux-kernel@...r.kernel.org, kexec@...ts.infradead.org, linux-mm@...ck.org, 
	nh-open-source@...zon.com
Subject: Re: [PATCH] kho: add support for deferred struct page init

> > NUMA node boundaries are SECTION_SIZE aligned. Since SECTION_SIZE is
> > larger than MAX_PAGE_ORDER it is mathematically impossible for a
> > single chunk to span multiple nodes.
>
> For folios, yes. The whole folio should only be in a single node. But we
> also have kho_preserve_pages() (formerly kho_preserve_phys()) which can
> be used to preserve an arbitrary size of memory and _that_ doesn't have
> to be in the same section. And if the memory is properly aligned, then
> it will end up being just one higher-order preservation in KHO.

Both restore pages and folios we use: kho_restore_page() which has the
following:

/*
* deserialize_bitmap() only sets the magic on the head page. This magic
* check also implicitly makes sure phys is order-aligned since for
* non-order-aligned phys addresses, magic will never be set.
*/
if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
return NULL;

My understanding the head page can never be more than MAX_PAGE_ORDER
hence why I am saying it will be less than SECTION_SIZE. With HugeTLB
the order can be more than MAX_PAGE_ORDER, but in that case it still
has to be within a single NID, since a huge page cannot be split
across multiple nodes.

> >> > This approach seems to give us the best of both worlds: It avoids the
> >> > memblock dependency during restoration. It keeps the serial work in
> >> > deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
> >> > heavy lifting of tail page initialization to be done later in the boot
> >> > process, potentially in parallel, as you suggested.
> >>
> >> Here's another idea I have been thinking about, but never dug deep
> >> enough to figure out if it actually works.
> >>
> >> __init_page_from_nid() loops through all the zones for the node to find
> >> the zone id for the page. We can flip it the other way round and loop
> >> through all zones (on all nodes) to find out if the PFN spans that zone.
> >> Once we find the zone, we can directly call __init_single_page() on it.
> >> If a contiguous chunk of preserved memory lands in one zone, we can
> >> batch the init to save some time.
> >>
> >> Something like the below (completely untested):
> >>
> >>
> >>         static void kho_init_page(struct page *page)
> >>         {
> >>                 unsigned long pfn = page_to_pfn(page);
> >>                 struct zone *zone;
> >>
> >>                 for_each_zone(zone) {
> >>                         if (zone_spans_pfn(zone, pfn))
> >>                                 break;
> >>                 }
> >>
> >>                 __init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone));
> >>         }
> >>
> >> It doesn't do the batching I mentioned, but I think it at least gets the
> >> point across. And I think even this simple version would be a good first
> >> step.
> >>
> >> This lets us initialize the page from kho_restore_folio() without having
> >> to rely of memblock being alive, and saves us from doing work during
> >> early boot. We should only have a handful of zones and nodes in
> >> practice, so I think it should perform fairly well too.
> >>
> >> We would of course need to see how it performs in practice. If it works,
> >> I think it would be cleaner and simpler than splitting the
> >> initialization into two separate parts.
> >
> > I think your idea is clever and would work. However, consider the
> > cache efficiency: in deserialize_bitmap(), we must write to the head
> > struct page anyway to preserve the order. Since we are already
> > bringing that 64-byte cacheline in and dirtying it, and since memblock
> > is available and fast at this stage, it makes sense to fully
> > initialize the head page right then.
>
> You will also bring in the cache line and dirty it during
> kho_restore_folio() since you need to write the page refcounts. So I
> don't think the cache efficiency makes any difference between either
> approach.
>
> > If we do that, we get the nid for "free" (cache-wise) and we avoid the
> > overhead of iterating zones during the restore phase. We can then
> > simply inherit the nid from the head page when initializing the tail
> > pages later.
>
> To get the nid, you would need to call early_pfn_to_nid(). This takes a
> spinlock and searches through all memblock memory regions. I don't think
> it is too expensive, but it isn't free either. And all this would be
> done serially. With the zone search, you at least have some room for
> concurrency.
>
> I think either approach only makes a difference when we have a large
> number of low-order preservations. If we have a handful of high-order
> preservations, I suppose the overhead of nid search would be negligible.

We should be targeting a situation where the vast majority of the
preserved memory is HugeTLB, but I am still worried about lower order
preservation efficiency for IOMMU page tables, etc.

> Long term, I think we should hook this into page_alloc_init_late() so
> that all the KHO pages also get initalized along with all the other
> pages. This will result in better integration of KHO with rest of MM
> init, and also have more consistent page restore performance.

But we keep KHO as reserved memory, and hooking it up into
page_alloc_init_late() would make it very different, since that memory
is part of the buddy allocator memory...

> Jason's radix tree patches will make that a bit easier to do I think.
> The zone search will scale better I reckon.

It could, perhaps early in boot we should reserve the radix tree, and
use it as a source of truth look-ups later in boot?