linux-kernel - Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness tracking changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z9QQxd2TfpupOzAk@yzhao56-desk.sh.intel.com>
Date: Fri, 14 Mar 2025 19:19:33 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: David Hildenbrand <david@...hat.com>
CC: "Shah, Amit" <Amit.Shah@....com>, "kvm@...r.kernel.org"
	<kvm@...r.kernel.org>, "Roth, Michael" <Michael.Roth@....com>,
	"liam.merwick@...cle.com" <liam.merwick@...cle.com>, "seanjc@...gle.com"
	<seanjc@...gle.com>, "jroedel@...e.de" <jroedel@...e.de>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>, "Sampat, Pratik Rajesh"
	<PratikRajesh.Sampat@....com>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>, "Lendacky, Thomas" <Thomas.Lendacky@....com>,
	"vbabka@...e.cz" <vbabka@...e.cz>, "pbonzini@...hat.com"
	<pbonzini@...hat.com>, "linux-coco@...ts.linux.dev"
	<linux-coco@...ts.linux.dev>, "quic_eberman@...cinc.com"
	<quic_eberman@...cinc.com>, "Kalra, Ashish" <Ashish.Kalra@....com>,
	"ackerleytng@...gle.com" <ackerleytng@...gle.com>, "vannapurve@...gle.com"
	<vannapurve@...gle.com>
Subject: Re: [PATCH RFC v1 0/5] KVM: gmem: 2MB THP support and preparedness
 tracking changes

On Fri, Mar 14, 2025 at 10:33:07AM +0100, David Hildenbrand wrote:
> On 14.03.25 10:09, Yan Zhao wrote:
> > On Wed, Jan 22, 2025 at 03:25:29PM +0100, David Hildenbrand wrote:
> > > (split is possible if there are no unexpected folio references; private
> > > pages cannot be GUP'ed, so it is feasible)
> > ...
> > > > > Note that I'm not quite sure about the "2MB" interface, should it be
> > > > > a
> > > > > "PMD-size" interface?
> > > > 
> > > > I think Mike and I touched upon this aspect too - and I may be
> > > > misremembering - Mike suggested getting 1M, 2M, and bigger page sizes
> > > > in increments -- and then fitting in PMD sizes when we've had enough of
> > > > those.  That is to say he didn't want to preclude it, or gate the PMD
> > > > work on enabling all sizes first.
> > > 
> > > Starting with 2M is reasonable for now. The real question is how we want to
> > > deal with
> > Hi David,
> > 
> 
> Hi!
> 
> > I'm just trying to understand the background of in-place conversion.
> > 
> > Regarding to the two issues you mentioned with THP and non-in-place-conversion,
> > I have some questions (still based on starting with 2M):
> > 
> > > (a) Not being able to allocate a 2M folio reliably
> > If we start with fault in private pages from guest_memfd (not in page pool way)
> > and shared pages anonymously, is it correct to say that this is only a concern
> > when memory is under pressure?
> 
> Usually, fragmentation starts being a problem under memory pressure, and
> memory pressure can show up simply because the page cache makes us of as
> much memory as it wants.
> 
> As soon as we start allocating a 2 MB page for guest_memfd, to then split it
> up + free only some parts back to the buddy (on private->shared conversion),
> we create fragmentation that cannot get resolved as long as the remaining
> private pages are not freed. A new conversion from shared->private on the
> previously freed parts will allocate other unmovable pages (not the freed
> ones) and make fragmentation worse.
Ah, I see. The problem of fragmentation is because memory allocated by
guest_memfd is unmovable. So after freeing part of a 2MB folio, the whole 2MB is
still unmovable. 

I previously thought fragmentation would only impact the guest by providing no
new huge pages. So if a confidential VM does not support merging small PTEs into
a huge PMD entry in its private page table, even if the new huge memory range is
physically contiguous after a private->shared->private conversion, the guest
still cannot bring back huge pages.

> In-place conversion improves that quite a lot, because guest_memfd tself
> will not cause unmovable fragmentation. Of course, under memory pressure,
> when and cannot allocate a 2M page for guest_memfd, it's unavoidable. But
> then, we already had fragmentation (and did not really cause any new one).
> 
> We discussed in the upstream call, that if guest_memfd (primarily) only
> allocates 2M pages and frees 2M pages, it will not cause fragmentation
> itself, which is pretty nice.
Makes sense.

> > 
> > > (b) Partial discarding
> > For shared pages, page migration and folio split are possible for shared THP?
> 
> I assume by "shared" you mean "not guest_memfd, but some other memory we use
Yes, not guest_memfd, in the case of non-in-place conversion.

> as an overlay" -- so no in-place conversion.
> 
> Yes, that should be possible as long as nothing else prevents
> migration/split (e.g., longterm pinning)
> 
> > 
> > For private pages, as you pointed out earlier, if we can ensure there are no
> > unexpected folio references for private memory, splitting a private huge folio
> > should succeed.
> 
> Yes, and maybe (hopefully) we'll reach a point where private parts will not
> have a refcount at all (initially, frozen refcount, discussed during the
> last upstream call).
Yes, I also tested in TDX by not acquiring folio ref count in TDX specific code
and found that partial splitting could work.

> Are you concerned about the memory fragmentation after repeated
> > partial conversions of private pages to and from shared?
> 
> Not only repeated, even just a single partial conversion. But of course,
> repeated partial conversions will make it worse (e.g., never getting a
> private huge page back when there was a partial conversion).
Thanks for the explanation!

Do you think there's any chance for guest_memfd to support non-in-place
conversion first?