linux-kernel - Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251201234447.5x62vhlg3d5mmtpw@amd.com>
Date: Mon, 1 Dec 2025 17:44:47 -0600
From: Michael Roth <michael.roth@....com>
To: Yan Zhao <yan.y.zhao@...el.com>
CC: <kvm@...r.kernel.org>, <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <thomas.lendacky@....com>,
	<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
	<ashish.kalra@....com>, <liam.merwick@...cle.com>, <david@...hat.com>,
	<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <aik@....com>,
	<ira.weiny@...el.com>
Subject: Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking

On Tue, Nov 25, 2025 at 11:13:25AM +0800, Yan Zhao wrote:
> On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote:
> > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote:
> > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote:
> > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > >  {
> > > >  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > >  	struct folio *folio;
> > > > -	bool is_prepared = false;
> > > >  	int r = 0;
> > > >  
> > > >  	CLASS(gmem_get_file, file)(slot);
> > > >  	if (!file)
> > > >  		return -EFAULT;
> > > >  
> > > > -	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > > > +	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> > > >  	if (IS_ERR(folio))
> > > >  		return PTR_ERR(folio);
> > > >  
> > > > -	if (!is_prepared)
> > > > -		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > > > +	if (!folio_test_uptodate(folio)) {
> > > > +		unsigned long i, nr_pages = folio_nr_pages(folio);
> > > > +
> > > > +		for (i = 0; i < nr_pages; i++)
> > > > +			clear_highpage(folio_page(folio, i));
> > > > +		folio_mark_uptodate(folio);
> > > Here, the entire folio is cleared only when the folio is not marked uptodate.
> > > Then, please check my questions at the bottom
> > 
> > Yes, in this patch at least where I tried to mirror the current logic. I
> > would not be surprised if we need to rework things for inplace/hugepage
> > support though, but decoupling 'preparation' from the uptodate flag is
> > the main goal here.
> Could you elaborate a little why the decoupling is needed if it's not for
> hugepage?

For instance, for in-place conversion:

  1. initial allocation: clear, set uptodate, fault in as private
  2. private->shared: call invalidate hook, fault in as shared
  3. shared->private: call prep hook, fault in as private

Here, 2/3 need to track where the current state is shared/private in
order to make appropriate architecture-specific changes (e.g. RMP table
updates). But we want to allow for non-destructive in-place conversion,
where a page is 'uptodate', but not in the desired shared/private state.
So 'uptodate' becomes a separate piece of state, which is still
reasonable for gmem to track in the current 4K-only implementation, and
provides for a reasonable approach to upstreaming in-place conversion,
which isn't far off for either SNP or TDX.

For hugepages, we'll have other things to consider, but those things are
probably still somewhat far off, and so we shouldn't block steps toward
in-place conversion based on uncertainty around hugepages. I think it's
gotten enough attention at least that we know it *can* work, e.g. even
if we take the inefficient/easy route of zero'ing the whole folio on
initial access, setting it uptodate, and never doing anything with 
uptodate again, it's still a usable implementation.

> 
> 
> > > > +	}
> > > > +
> > > > +	r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > > >  
> > > >  	folio_unlock(folio);
> > > >  
> > > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > >  		struct folio *folio;
> > > >  		gfn_t gfn = start_gfn + i;
> > > >  		pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > > -		bool is_prepared = false;
> > > >  		kvm_pfn_t pfn;
> > > >  
> > > >  		if (signal_pending(current)) {
> > > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > >  			break;
> > > >  		}
> > > >  
> > > > -		folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > > +		folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order);
> > > >  		if (IS_ERR(folio)) {
> > > >  			ret = PTR_ERR(folio);
> > > >  			break;
> > > >  		}
> > > >  
> > > > -		if (is_prepared) {
> > > > -			folio_unlock(folio);
> > > > -			folio_put(folio);
> > > > -			ret = -EEXIST;
> > > > -			break;
> > > > -		}
> > > > -
> > > >  		folio_unlock(folio);
> > > >  		WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) ||
> > > >  			(npages - i) < (1 << max_order));
> > > TDX could hit this warning easily when npages == 1, max_order == 9.
> > 
> > Yes, this will need to change to handle that. I don't think I had to
> > change this for previous iterations of SNP hugepage support, but
> > there are definitely cases where a sub-2M range might get populated 
> > even though it's backed by a 2M folio, so I'm not sure why I didn't
> > hit it there.
> > 
> > But I'm taking Sean's cue on touching as little of the existing
> > hugepage logic as possible in this particular series so we can revisit
> > the remaining changes with some better context.
> Frankly, I don't understand why this patch 1 is required if we only want "moving
> GUP out of post_populate()" to work for 4KB folios.

Above I outlined one of the use-cases for in-place conversion.

During the 2 PUCK sessions prior to this RFC, Sean also mentioned some
potential that other deadlocks might exist in current code due to
how the locking is currently handled, and that we should consider this
as a general cleanup against current kvm/next, but I leave that to Sean
to elaborate on.

Personally I think this series makes sense against kvm/next regardless:
tracking preparation in gmem is basically already broken: everyone ignores
it except SNP, so it was never performing that duty as-designed. So we
are now simplying uptodate flag to no longer include this extra
state-tracking, and leaving it for architecture-specific tracking. I
can't see that be anything but beneficial to future gmem changes.

> 
> > > 
> > > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > >  		p = src ? src + i * PAGE_SIZE : NULL;
> > > >  		ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
> > > >  		if (!ret)
> > > > -			kvm_gmem_mark_prepared(folio);
> > > > +			folio_mark_uptodate(folio);
> > > As also asked in [1], why is the entire folio marked as uptodate here? Why does
> > > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked
> > > uptodate?
> > 
> > Quoting your example from[1] for more context:
> > 
> > > I also have a question about this patch:
> > > 
> > > Suppose there's a 2MB huge folio A, where
> > > A1 and A2 are 4KB pages belonging to folio A.
> > > 
> > > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A.
> > >     It adds page A1 and invokes folio_mark_uptodate() on folio A.
> > 
> > In SNP hugepage patchset you responded to, it would only mark A1 as
> You mean code in
> https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ?

No, sorry, I got my references mixed up. The only publically-posted
version of SNP hugepage support is the THP series that does not involve
in-place conversion, and that's what I was referencing. It's there where
per-4K bitmap was added to track preparation, and in that series
page-clearing/preparation are still coupled to some degree so per-4K
tracking of page-clearing was still possible and that's how it was
handled:

  https://github.com/AMDESE/linux/blob/snp-prepare-thp-rfc1/virt/kvm/guest_memfd.c#L992

but that can be considered an abandoned approach so I wouldn't spend
much time referencing that.

> 
> > prepared/cleared. There was 4K-granularity tracking added to handle this.
> I don't find the code that marks only A1 as "prepared/cleared".
> Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate()
> to mark the entire folio A as uptodate.
> 
> However, according to your statement below that "uptodate flag only tracks
> whether a folio has been cleared", I don't follow why and where the entire folio
> A would be cleared if kvm_gmem_populate() only adds page A1.
> 
> > There was an odd subtlety in that series though: it was defaulting to the
> > folio_order() for the prep-tracking/post-populate, but it would then clamp
> > it down based on the max order possible according whether that particular
> > order was a homogenous range of KVM_MEMORY_ATTRIBUTE_PRIVATE. Which is not
> > a great way to handle things, and I don't remember if I'd actually intended
> > to implement it that way or not... that's probably why I never tripped over
> > the WARN_ON() above, now that I think of it.
> > 
> > But neither of these these apply to any current plans for hugepage support
> > that I'm aware of, so probably not worth working through what that series
> > did and look at this from a fresh perspective.
> > 
> > > 
> > > (2) kvm_gmem_get_pfn() later faults in page A2.
> > >     As folio A is uptodate, clear_highpage() is not invoked on page A2.
> > >     kvm_gmem_prepare_folio() is invoked on the whole folio A.
> > > 
> > > (2) could occur at least in TDX when only a part the 2MB page is added as guest
> > > initial memory.
> > > 
> > > My questions:
> > > - Would (2) occur on SEV?
> > > - If it does, is the lack of clear_highpage() on A2 a problem ?
> > > - Is invoking gmem_prepare on page A1 a problem?
> > 
> > Assuming this patch goes upstream in some form, we will now have the
> > following major differences versus previous code:
> > 
> >   1) uptodate flag only tracks whether a folio has been cleared
> >   2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and
> >      the architecture can handle it's own tracking at whatever granularity
> >      it likes.
> 2) looks good to me.
> 
> > My hope is that 1) can similarly be done in such a way that gmem does not
> > need to track things at sub-hugepage granularity and necessitate the need
> > for some new data structure/state/flag to track sub-page status.
> I actually don't understand what uptodate flag helps gmem to track.
> Why can't clear_highpage() be done inside arch specific code? TDX doesn't need
> this clearing after all.

It could. E.g. via the kernel-internal gmem flag that I mentioned in my
earlier reply, or some alternative. 

In the context of this series, uptodate flag continues to instruct
kvm_gmem_get_pfn() that it doesn't not need to re-clear pages, because
a prior kvm_gmem_get_pfn() or kvm_gmem_populate() already initialized
the folio, and it is no longer tied to any notion of
preparedness-tracking.

What use uptodate will have in the context of hugepages: I'm not sure.
For non-in-place conversion, it's tempting to just let it continue to be
per-folio and require clearing the whole folio on initial access, but
it's not efficient. It may make sense to farm it out to
post-populate/prep hooks instead, as you're suggesting for TDX.

But then, for in-place conversion, you have to deal with pages initially
faulted in as shared. They might be split prior to initial access as a
private page, where we can't assume TDX will have scrubbed things. So in
that case it might still make sense to rely on it.

Definitely things that require some more thought. But having it inextricably
tied to preparedness just makes preparation tracking similarly more
complicated as it pulls it back into gmem when that does not seem to be
the direction any architectures other SNP have/want to go.

> 
> > My understanding based on prior discussion in guest_memfd calls was that
> > it would be okay to go ahead and clear the entire folio at initial allocation
> > time, and basically never mess with it again. It was also my understanding
> That's where I don't follow in this patch.
> I don't see where the entire folio A is cleared if it's only partially mapped by
> kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to
> kvm_gmem_populate() has set the uptodate flag.
> 
> > that for TDX it might even be optimal to completely skip clearing the folio
> > if it is getting mapped into SecureEPT as a hugepage since the TDX module
> > would handle that, but that maybe conversely after private->shared there
> > would be some need to reclear... I'll try to find that discussion and
> > refresh. Vishal I believe suggested some flags to provide more control over
> > this behavior.
> > 
> > > 
> > > It's possible (at least for TDX) that a huge folio is only partially populated
> > > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the
> > > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region,
> > > while GFN 0x820 is faulted after TD is running. However, these two GFNs can
> > > belong to the same folio of order 9.
> > 
> > Would the above scheme of clearing the entire folio up front and not re-clearing
> > at fault time work for this case?
> This case doesn't affect TDX, because TDX clearing private pages internally in
> SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage()
> after making a folio private, it works fine for TDX.
> 
> I was just trying to understand why SNP needs the clearing of entire folio in
> kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's
> partially mapped in kvm_gmem_populate().
> Also, I'm wondering if it would be better if SNP could move the clearing of
> folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare()
> which is always invoked by kvm_gmem_get_pfn() and the architecture can handle
> it's own tracking at whatever granularity.

Possibly, but I touched elsewhere on where in-place conversion might
trip up this approach. At least decoupling them allows for the prep side
of things to be moved to architecture-specific tracking. We can deal
with uptodate separately I think.

-Mike

> 
>  
> > > Note: the current code should not impact TDX. I'm just asking out of curiosity:)
> > > 
> > > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/
> > > 
> > >