linux-kernel - Re: [RFC PATCH 0/4] Extend xas_split* to support splitting arbitrarily large entries

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <diqz8qfh69uq.fsf@google.com>
Date: Thu, 04 Dec 2025 16:38:53 -0800
From: Ackerley Tng <ackerleytng@...gle.com>
To: Matthew Wilcox <willy@...radead.org>
Cc: akpm@...ux-foundation.org, linux-fsdevel@...r.kernel.org, 
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, david@...hat.com, 
	michael.roth@....com, vannapurve@...gle.com
Subject: Re: [RFC PATCH 0/4] Extend xas_split* to support splitting
 arbitrarily large entries

Ackerley Tng <ackerleytng@...gle.com> writes:

> Matthew Wilcox <willy@...radead.org> writes:
>
>> On Mon, Nov 17, 2025 at 02:46:57PM -0800, Ackerley Tng wrote:
>>> guest_memfd is planning to store huge pages in the filemap, and
>>> guest_memfd's use of huge pages involves splitting of huge pages into
>>> individual pages. Splitting of huge pages also involves splitting of
>>> the filemap entries for the pages being split.
>
>>
>> Hm, I'm not most concerned about the number of nodes you're allocating.
>
> Thanks for reminding me, I left this out of the original message.
>
> Splitting the xarray entry for a 1G folio (in a shift-18 node for
> order=18 on x86), assuming XA_CHUNK_SHIFT is 6, would involve
>
> + shift-18 node (the original node will be reused - no new allocations)
> + shift-12 node: 1 node allocated
> + shift-6 node : 64 nodes allocated
> + shift-0 node : 64 * 64 = 4096 nodes allocated
>
> This brings the total number of allocated nodes to 4161 nodes. struct
> xa_node is 576 bytes, so that's 2396736 bytes or 2.28 MB, so splitting a
> 1G folio to 4K pages costs ~2.5 MB just in filemap (XArray) entry
> splitting. The other large memory cost would be from undoing HVO for the
> HugeTLB folio.
>

At the guest_memfd biweekly call this morning, we touched on this topic
again. David pointed out that the ~2MB overhead to store a 1G folio in
the filemap seems a little high.

IIUC the above is correct, so even if we put aside splitting, without
multi-index XArrays, storing a 1G folio in the filemap would incur this
number of nodes in overheads. (Hence multi-index XArrays are great :))

>> I'm most concerned that, once we have memdescs, splitting a 1GB page
>> into 512 * 512 4kB pages is going to involve allocating about 20MB
>> of memory (80 bytes * 512 * 512).
>
> I definitely need to catch up on memdescs. What's the best place for me
> to learn/get an overview of how memdescs will describe memory/replace
> struct folios?
>
> I think there might be a better way to solve the original problem of
> usage tracking with memdesc support, but this was intended to make
> progress before memdescs.
>
>> Is this necessary to do all at once?
>
> The plan for guest_memfd was to first split from 1G to 4K, then optimize
> on that by splitting in stages, from 1G to 2M as much as possible, then
> to 4K only for the page ranges that the guest shared with the host.

David asked if splitting from 1G to 2M would remove the need for this
extension patch series. On the call, I wrongly agreed - looking at the
code again, even though the existing code kind of takes input for the
target order of the split though xas, it actually still does not split
to the requested order.

I think some workarounds could be possible, but for the introduction of
guest_memfd HugeTLB with folio restructuring, taking a dependency on
non-uniform splits (splitting 1G to 511 2M folios and 512 4K folios) is
significant complexity for a single series. It is significant because in
addition to having to deal with non-uniform splits of the folios, we'd
also have to deal with non-uniform HugeTLB vmemmap optimization.

Hence I'm hoping that I could get help reviewing these changes, so that
guest_memfd HugeTLB with non-uniform splits could be handled in a later
stage as an optimization. Besides, David says generalizing this could
help unblock other things (I forgot the detail, maybe David can chime in
here) :)

Thanks!