linux-kernel - Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <11280705-bcb1-4a5e-a689-b8a5f8a0a9a6@redhat.com>
Date: Fri, 20 Dec 2024 12:31:25 +0100
From: David Hildenbrand <david@...hat.com>
To: Michael Roth <michael.roth@....com>, kvm@...r.kernel.org
Cc: linux-coco@...ts.linux.dev, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org, jroedel@...e.de, thomas.lendacky@....com,
 pbonzini@...hat.com, seanjc@...gle.com, vbabka@...e.cz, amit.shah@....com,
 pratikrajesh.sampat@....com, ashish.kalra@....com, liam.merwick@...cle.com,
 vannapurve@...gle.com, ackerleytng@...gle.com, quic_eberman@...cinc.com
Subject: Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness
 tracking changes

On 12.12.24 07:36, Michael Roth wrote:
> This patchset is also available at:
> 
>    https://github.com/amdese/linux/commits/snp-prepare-thp-rfc1
> 
> and is based on top of Paolo's kvm-coco-queue-2024-11 tag which includes
> a snapshot of his patches[1] to provide tracking of whether or not
> sub-pages of a huge folio need to have kvm_arch_gmem_prepare() hooks issued
> before guest access:
> 
>    d55475f23cea KVM: gmem: track preparedness a page at a time
>    64b46ca6cd6d KVM: gmem: limit hole-punching to ranges within the file
>    17df70a5ea65 KVM: gmem: add a complete set of functions to query page preparedness
>    e3449f6841ef KVM: gmem: allocate private data for the gmem inode
> 
>    [1] https://lore.kernel.org/lkml/20241108155056.332412-1-pbonzini@redhat.com/
> 
> This series addresses some of the pending review comments for those patches
> (feel free to squash/rework as-needed), and implements a first real user in
> the form of a reworked version of Sean's original 2MB THP support for gmem.
> 
> It is still a bit up in the air as to whether or not gmem should support
> THP at all rather than moving straight to 2MB/1GB hugepages in the form of
> something like HugeTLB folios[2] or the lower-level PFN range allocator
> presented by Yu Zhao during the guest_memfd call last week. The main
> arguments against THP, as I understand it, is that THPs will become
> split over time due to hole-punching and rarely have an opportunity to get
> rebuilt due to lack of memory migration support for current CoCo hypervisor
> implementations like SNP (and adding the migration support to resolve that
> not necessarily resulting in a net-gain performance-wise). The current
> plan for SNP, as discussed during the first guest_memfd call, is to
> implement something similar to 2MB HugeTLB, and disallow hole-punching
> at sub-2MB granularity.
> 
> However, there have also been some discussions during recent PUCK calls
> where the KVM maintainers have some still expressed some interest in pulling
> in gmem THP support in a more official capacity. The thinking there is that
> hole-punching is a userspace policy, and that it could in theory avoid
> holepunching for sub-2MB GFN ranges to avoid degradation over time.
> And if there's a desire to enforce this from the kernel-side by blocking
> sub-2MB hole-punching from the host-side, this would provide similar
> semantics/behavior to the 2MB HugeTLB-like approach above.
> 
> So maybe there is still some room for discussion about these approaches.
> 
> Outside that, there are a number of other development areas where it would
> be useful to at least have some experimental 2MB support in place so that
> those efforts can be pursued in parallel, such as the preparedness
> tracking touched on here, and exploring how that will intersect with other
> development areas like using gmem for both shared and private memory, mmap
> support, guest_memfd library, etc., so my hopes are that this approach
> could be useful for that purpose at least, even if only as an out-of-tree
> stop-gap.
> 
> Thoughts/comments welcome!

Sorry for the late reply, it's been a couple of crazy weeks, and I'm 
trying to give at least some feedback on stuff in my inbox before even 
more will pile up over Christmas :) . Let me summarize my thoughts:

THPs in Linux rely on the following principle:

(1) We try allocating a THP, if that fails we rely on khugepaged to fix
     it up later (shmem+anon). So id we cannot grab a free THP, we
     deffer it to a later point.

(2) We try to be as transparent as possible: punching a hole will
     usually destroy the THP (either immediately for shmem/pagecache or
     deferred for anon memory) to free up the now-free pages. That's
     different to hugetlb, where partial hole-punching will always zero-
     out the memory only; the partial memory will not get freed up and
     will get reused later.

     Destroying a THP for shmem/pagecache only works if there are no
     unexpected page references, so there can be cases where we fail to
     free up memory. For the pagecache that's not really
     an issue, because memory reclaim will fix that up at some point. For
     shmem, there  were discussions to do scan for 0ed pages and free
     them up during memory reclaim, just like we do now for anon memory
      as well.

(3) Memory compaction is vital for guaranteeing that we will be able to
     create THPs the longer the system was running,


With guest_memfd we cannot rely on any daemon to fix it up as in (1) for 
us later (would require page memory migration support).

We use truncate_inode_pages_range(), which will split a THP into small 
pages if you partially punch-hole it, so (2) would apply; splitting 
might fail as well in some cases if there are unexpected references.

I wonder what would happen if user space would punch a hole in private 
memory, making truncate_inode_pages_range() overwrite it with 0s if 
splitting the THP failed (memory write to private pages under TDX?). 
Maybe something similar would happen if a private page would get 0-ed 
out when freeing+reallocating it, not sure how that is handled.


guest_memfd currently actively works against (3) as soon as we (A) 
fallback to allocating small pages or (B) split a THP due to hole 
punching, as the remaining fragments cannot get reassembled anymore.

I assume there is some truth to "hole-punching is a userspace policy", 
but this mechanism will actively work against itself as soon as you 
start falling back to small pages in any way.



So I'm wondering if a better start would be to (A) always allocate huge 
pages from the buddy (no fallback) and (B) partial punches are either 
disallowed or only zero-out the memory. But even a sequence of partial 
punches that cover the whole huge page will not end up freeing all parts 
if splitting failed at some point, which I quite dislike ...

But then we'd need memory preallocation, and I suspect to make this 
really useful -- just like with 2M/1G "hugetlb" support -- in-place 
shared<->private conversion will be a requirement. ... at which point 
we'd have reached the state where it's almost the 2M hugetlb support.


This is not a very strong push back, more a "this does not quite sound 
right to me" and I have the feeling that this might get in the way of 
in-place shared<->private conversion; I might be wrong about the latter 
though.

With memory compaction working for guest_memfd, it would all be easier.

Note that I'm not quite sure about the "2MB" interface, should it be a 
"PMD-size" interface?

-- 
Cheers,

David / dhildenb