linux-kernel - Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9jZRdFyyr1DFkvV@yzhao56-desk.sh.intel.com>
Date: Tue, 18 Mar 2025 10:24:05 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: David Hildenbrand <david@...hat.com>, "Shah, Amit" <Amit.Shah@....com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "Roth, Michael"
	<Michael.Roth@....com>, "liam.merwick@...cle.com" <liam.merwick@...cle.com>,
	"seanjc@...gle.com" <seanjc@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "Sampat, Pratik Rajesh"
	<PratikRajesh.Sampat@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "Lendacky, Thomas" <Thomas.Lendacky@....com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "linux-coco@...ts.linux.dev"
	<linux-coco@...ts.linux.dev>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "Kalra, Ashish" <Ashish.Kalra@....com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "vannapurve@...gle.com"
	<vannapurve@...gle.com>
Subject: Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness
 tracking changes

On Fri, Mar 14, 2025 at 07:19:33PM +0800, Yan Zhao wrote:
> On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:
> > On 14.03.25 10:09, Yan Zhao wrote:
> > > On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
> > > > (split is possible if there are no unexpected folio references; private
> > > > pages cannot be GUP'ed, so it is feasible)
> > > ...
> > > > > > Note that I'm not quite sure about the "2MB" interface, should it be
> > > > > > a
> > > > > > "PMD-size" interface?
> > > > > 
> > > > > I think Mike and I touched upon this aspect too - and I may be
> > > > > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
> > > > > in increments -- and then fitting in PMD sizes when we've had enough of
> > > > > those.  That is to say he didn't want to preclude it, or gate the PMD
> > > > > work on enabling all sizes first.
> > > > 
> > > > Starting with 2M is reasonable for now. The real question is how we want to
> > > > deal with
> > > Hi David,
> > > 
> > 
> > Hi!
> > 
> > > I'm just trying to understand the background of in-place conversion.
> > > 
> > > Regarding to the two issues you mentioned with THP and non-in-place-conversion,
> > > I have some questions (still based on starting with 2M):
> > > 
> > > > (a) Not being able to allocate a 2M folio reliably
> > > If we start with fault in private pages from guest_memfd (not in page pool way)
> > > and shared pages anonymously, is it correct to say that this is only a concern
> > > when memory is under pressure?
> > 
> > Usually, fragmentation starts being a problem under memory pressure, and
> > memory pressure can show up simply because the page cache makes us of as
> > much memory as it wants.
> > 
> > As soon as we start allocating a 2 MB page for guest_memfd, to then split it
> > up + free only some parts back to the buddy (on private->shared conversion),
> > we create fragmentation that cannot get resolved as long as the remaining
> > private pages are not freed. A new conversion from shared->private on the
> > previously freed parts will allocate other unmovable pages (not the freed
> > ones) and make fragmentation worse.
> Ah, I see. The problem of fragmentation is because memory allocated by
> guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
> still unmovable. 
> 
> I previously thought fragmentation would only impact the guest by providing no
> new huge pages. So if a confidential VM does not support merging small PTEs into
> a huge PMD entry in its private page table, even if the new huge memory range is
> physically contiguous after a private->shared->private conversion, the guest
> still cannot bring back huge pages.
> 
> > In-place conversion improves that quite a lot, because guest_memfd tself
> > will not cause unmovable fragmentation. Of course, under memory pressure,
> > when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
> > then, we already had fragmentation (and did not really cause any new one).
> > 
> > We discussed in the upstream call, that if guest_memfd (primarily) only
> > allocates 2M pages and frees 2M pages, it will not cause fragmentation
> > itself, which is pretty nice.
> Makes sense.
> 
> > > 
> > > > (b) Partial discarding
> > > For shared pages, page migration and folio split are possible for shared THP?
> > 
> > I assume by "shared" you mean "not guest_memfd, but some other memory we use
> Yes, not guest_memfd, in the case of non-in-place conversion.
> 
> > as an overlay" -- so no in-place conversion.
> > 
> > Yes, that should be possible as long as nothing else prevents
> > migration/split (e.g., longterm pinning)
> > 
> > > 
> > > For private pages, as you pointed out earlier, if we can ensure there are no
> > > unexpected folio references for private memory, splitting a private huge folio
> > > should succeed.
> > 
> > Yes, and maybe (hopefully) we'll reach a point where private parts will not
> > have a refcount at all (initially, frozen refcount, discussed during the
> > last upstream call).
> Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
> and found that partial splitting could work.
> 
> > Are you concerned about the memory fragmentation after repeated
> > > partial conversions of private pages to and from shared?
> > 
> > Not only repeated, even just a single partial conversion. But of course,
> > repeated partial conversions will make it worse (e.g., never getting a
> > private huge page back when there was a partial conversion).
> Thanks for the explanation!
> 
> Do you think there's any chance for guest_memfd to support non-in-place
> conversion first?
e.g. we can have private pages allocated from guest_memfd and allows the
private pages to be THP.

Meanwhile, shared pages are not allocated from guest_memfd, and let it only
fault in 4K granularity. (specify it by a flag?)

When we want to convert a 4K from a 2M private folio to shared, we can just
split the 2M private folio as there's no extra ref count of private pages;

when we do shared to private conversion, no split is required as shared pages
are in 4K granularity. And even if user fails to specify the shared pages as
small pages only, the worst thing is that a 2M shared folio cannot be split, and
more memory is consumed.

Of couse, memory fragmentation is still an issue as the private pages are
allocated unmovable. But do you think it's a good simpler start before in-place
conversion is ready?

Thanks
Yan