lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <863452cwns.fsf@kernel.org>
Date: Mon, 22 Dec 2025 17:24:07 +0100
From: Pratyush Yadav <pratyush@...nel.org>
To: Pasha Tatashin <pasha.tatashin@...een.com>
Cc: Pratyush Yadav <pratyush@...nel.org>,  Mike Rapoport <rppt@...nel.org>,
  Evangelos Petrongonas <epetron@...zon.de>,  Alexander Graf
 <graf@...zon.com>,  Andrew Morton <akpm@...ux-foundation.org>,  Jason Miu
 <jasonmiu@...gle.com>,  linux-kernel@...r.kernel.org,
  kexec@...ts.infradead.org,  linux-mm@...ck.org,
  nh-open-source@...zon.com
Subject: Re: [PATCH] kho: add support for deferred struct page init

On Mon, Dec 22 2025, Pasha Tatashin wrote:

>> > NUMA node boundaries are SECTION_SIZE aligned. Since SECTION_SIZE is
>> > larger than MAX_PAGE_ORDER it is mathematically impossible for a
>> > single chunk to span multiple nodes.
>>
>> For folios, yes. The whole folio should only be in a single node. But we
>> also have kho_preserve_pages() (formerly kho_preserve_phys()) which can
>> be used to preserve an arbitrary size of memory and _that_ doesn't have
>> to be in the same section. And if the memory is properly aligned, then
>> it will end up being just one higher-order preservation in KHO.
>
> Both restore pages and folios we use: kho_restore_page() which has the
> following:
>
> /*
> * deserialize_bitmap() only sets the magic on the head page. This magic
> * check also implicitly makes sure phys is order-aligned since for
> * non-order-aligned phys addresses, magic will never be set.
> */
> if (WARN_ON_ONCE(info.magic != KHO_PAGE_MAGIC || info.order > MAX_PAGE_ORDER))
> return NULL;

See my patch that drops this restriction:
https://lore.kernel.org/linux-mm/20251206230222.853493-2-pratyush@kernel.org/

I think it was wrong to add it in the first place.

>
> My understanding the head page can never be more than MAX_PAGE_ORDER
> hence why I am saying it will be less than SECTION_SIZE. With HugeTLB
> the order can be more than MAX_PAGE_ORDER, but in that case it still
> has to be within a single NID, since a huge page cannot be split
> across multiple nodes.

For a "proper" page/folio, that either comes from the page allocator or
from HugeTLB, you are right. But see again how kho_preserve_pages()
works:

	while (pfn < end_pfn) {
		const unsigned int order =
			min(count_trailing_zeros(pfn), ilog2(end_pfn - pfn));
	
		err = __kho_preserve_order(track, pfn, order);
		[...]

It combines contiguous order-aligned pages into one KHO preservation.

So say I have two nodes, each 64G. If I call kho_preserve_pages() for
62G to 66G, I will get _one_ 4G preservation at 62G. kho_restore_page()
will split it into 0-order pages on restore.

>
>> >> > This approach seems to give us the best of both worlds: It avoids the
>> >> > memblock dependency during restoration. It keeps the serial work in
>> >> > deserialize_bitmap() to a minimum (O(1)O(1) per region). It allows the
>> >> > heavy lifting of tail page initialization to be done later in the boot
>> >> > process, potentially in parallel, as you suggested.
>> >>
>> >> Here's another idea I have been thinking about, but never dug deep
>> >> enough to figure out if it actually works.
>> >>
>> >> __init_page_from_nid() loops through all the zones for the node to find
>> >> the zone id for the page. We can flip it the other way round and loop
>> >> through all zones (on all nodes) to find out if the PFN spans that zone.
>> >> Once we find the zone, we can directly call __init_single_page() on it.
>> >> If a contiguous chunk of preserved memory lands in one zone, we can
>> >> batch the init to save some time.
>> >>
>> >> Something like the below (completely untested):
>> >>
>> >>
>> >>         static void kho_init_page(struct page *page)
>> >>         {
>> >>                 unsigned long pfn = page_to_pfn(page);
>> >>                 struct zone *zone;
>> >>
>> >>                 for_each_zone(zone) {
>> >>                         if (zone_spans_pfn(zone, pfn))
>> >>                                 break;
>> >>                 }
>> >>
>> >>                 __init_single_page(page, pfn, zone_idx(zone), zone_to_nid(zone));
>> >>         }
>> >>
>> >> It doesn't do the batching I mentioned, but I think it at least gets the
>> >> point across. And I think even this simple version would be a good first
>> >> step.
>> >>
>> >> This lets us initialize the page from kho_restore_folio() without having
>> >> to rely of memblock being alive, and saves us from doing work during
>> >> early boot. We should only have a handful of zones and nodes in
>> >> practice, so I think it should perform fairly well too.
>> >>
>> >> We would of course need to see how it performs in practice. If it works,
>> >> I think it would be cleaner and simpler than splitting the
>> >> initialization into two separate parts.
>> >
>> > I think your idea is clever and would work. However, consider the
>> > cache efficiency: in deserialize_bitmap(), we must write to the head
>> > struct page anyway to preserve the order. Since we are already
>> > bringing that 64-byte cacheline in and dirtying it, and since memblock
>> > is available and fast at this stage, it makes sense to fully
>> > initialize the head page right then.
>>
>> You will also bring in the cache line and dirty it during
>> kho_restore_folio() since you need to write the page refcounts. So I
>> don't think the cache efficiency makes any difference between either
>> approach.
>>
>> > If we do that, we get the nid for "free" (cache-wise) and we avoid the
>> > overhead of iterating zones during the restore phase. We can then
>> > simply inherit the nid from the head page when initializing the tail
>> > pages later.
>>
>> To get the nid, you would need to call early_pfn_to_nid(). This takes a
>> spinlock and searches through all memblock memory regions. I don't think
>> it is too expensive, but it isn't free either. And all this would be
>> done serially. With the zone search, you at least have some room for
>> concurrency.
>>
>> I think either approach only makes a difference when we have a large
>> number of low-order preservations. If we have a handful of high-order
>> preservations, I suppose the overhead of nid search would be negligible.
>
> We should be targeting a situation where the vast majority of the
> preserved memory is HugeTLB, but I am still worried about lower order
> preservation efficiency for IOMMU page tables, etc.

Yep. Plus we might get VMMs stashing some of their state in a memfd too.

>
>> Long term, I think we should hook this into page_alloc_init_late() so
>> that all the KHO pages also get initalized along with all the other
>> pages. This will result in better integration of KHO with rest of MM
>> init, and also have more consistent page restore performance.
>
> But we keep KHO as reserved memory, and hooking it up into
> page_alloc_init_late() would make it very different, since that memory
> is part of the buddy allocator memory...

The idea I have is to have a separate call in page_alloc_init_late()
that initalizes KHO pages. It would traverse the radix tree (probably in
parallel by distributing the address space across multiple threads?) and
initialize all the pages. Then kho_restore_page() would only have to
double-check the magic and it can directly return the page.

Radix tree makes parallelism easier than the linked lists we have now.

>
>> Jason's radix tree patches will make that a bit easier to do I think.
>> The zone search will scale better I reckon.
>
> It could, perhaps early in boot we should reserve the radix tree, and
> use it as a source of truth look-ups later in boot?

Yep. I think the radix tree should mark its own pages as preserved too
so they stick around later in boot.

-- 
Regards,
Pratyush Yadav

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ