linux-kernel - Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aS6uoFyqF+SdGWlI@yzhao56-desk.sh.intel.com>
Date: Tue, 2 Dec 2025 17:17:20 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Michael Roth <michael.roth@....com>
CC: <kvm@...r.kernel.org>, <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <thomas.lendacky@....com>,
	<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
	<ashish.kalra@....com>, <liam.merwick@...cle.com>, <david@...hat.com>,
	<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <aik@....com>,
	<ira.weiny@...el.com>
Subject: Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking

On Mon, Dec 01, 2025 at 05:44:47PM -0600, Michael Roth wrote:
> On Tue, Nov 25, 2025 at 11:13:25AM +0800, Yan Zhao wrote:
> > On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote:
> > > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote:
> > > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote:
> > > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > >  {
> > > > >  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > > >  	struct folio *folio;
> > > > > -	bool is_prepared = false;
> > > > >  	int r = 0;
> > > > >  
> > > > >  	CLASS(gmem_get_file, file)(slot);
> > > > >  	if (!file)
> > > > >  		return -EFAULT;
> > > > >  
> > > > > -	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > > > > +	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> > > > >  	if (IS_ERR(folio))
> > > > >  		return PTR_ERR(folio);
> > > > >  
> > > > > -	if (!is_prepared)
> > > > > -		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > > > > +	if (!folio_test_uptodate(folio)) {
> > > > > +		unsigned long i, nr_pages = folio_nr_pages(folio);
> > > > > +
> > > > > +		for (i = 0; i < nr_pages; i++)
> > > > > +			clear_highpage(folio_page(folio, i));
> > > > > +		folio_mark_uptodate(folio);
> > > > Here, the entire folio is cleared only when the folio is not marked uptodate.
> > > > Then, please check my questions at the bottom
> > > 
> > > Yes, in this patch at least where I tried to mirror the current logic. I
> > > would not be surprised if we need to rework things for inplace/hugepage
> > > support though, but decoupling 'preparation' from the uptodate flag is
> > > the main goal here.
> > Could you elaborate a little why the decoupling is needed if it's not for
> > hugepage?
> 
> For instance, for in-place conversion:
> 
>   1. initial allocation: clear, set uptodate, fault in as private
>   2. private->shared: call invalidate hook, fault in as shared
>   3. shared->private: call prep hook, fault in as private
> 
> Here, 2/3 need to track where the current state is shared/private in
> order to make appropriate architecture-specific changes (e.g. RMP table
> updates). But we want to allow for non-destructive in-place conversion,
> where a page is 'uptodate', but not in the desired shared/private state.
> So 'uptodate' becomes a separate piece of state, which is still
> reasonable for gmem to track in the current 4K-only implementation, and
> provides for a reasonable approach to upstreaming in-place conversion,
> which isn't far off for either SNP or TDX.
To me, "1. initial allocation: clear, set uptodate" is more appropriate to
be done in kvm_gmem_get_folio(), instead of in kvm_gmem_get_pfn().

With it, below looks reasonable to me.
> For hugepages, we'll have other things to consider, but those things are
> probably still somewhat far off, and so we shouldn't block steps toward
> in-place conversion based on uncertainty around hugepages. I think it's
> gotten enough attention at least that we know it *can* work, e.g. even
> if we take the inefficient/easy route of zero'ing the whole folio on
> initial access, setting it uptodate, and never doing anything with 
> uptodate again, it's still a usable implementation.

<...>
> > > Assuming this patch goes upstream in some form, we will now have the
> > > following major differences versus previous code:
> > > 
> > >   1) uptodate flag only tracks whether a folio has been cleared
> > >   2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and
> > >      the architecture can handle it's own tracking at whatever granularity
> > >      it likes.
> > 2) looks good to me.
> > 
> > > My hope is that 1) can similarly be done in such a way that gmem does not
> > > need to track things at sub-hugepage granularity and necessitate the need
> > > for some new data structure/state/flag to track sub-page status.
> > I actually don't understand what uptodate flag helps gmem to track.
> > Why can't clear_highpage() be done inside arch specific code? TDX doesn't need
> > this clearing after all.
> 
> It could. E.g. via the kernel-internal gmem flag that I mentioned in my
> earlier reply, or some alternative. 
> 
> In the context of this series, uptodate flag continues to instruct
> kvm_gmem_get_pfn() that it doesn't not need to re-clear pages, because
> a prior kvm_gmem_get_pfn() or kvm_gmem_populate() already initialized
> the folio, and it is no longer tied to any notion of
> preparedness-tracking.
> 
> What use uptodate will have in the context of hugepages: I'm not sure.
> For non-in-place conversion, it's tempting to just let it continue to be
> per-folio and require clearing the whole folio on initial access, but
> it's not efficient. It may make sense to farm it out to
> post-populate/prep hooks instead, as you're suggesting for TDX.
> 
> But then, for in-place conversion, you have to deal with pages initially
> faulted in as shared. They might be split prior to initial access as a
> private page, where we can't assume TDX will have scrubbed things. So in
> that case it might still make sense to rely on it.
> 
> Definitely things that require some more thought. But having it inextricably
> tied to preparedness just makes preparation tracking similarly more
> complicated as it pulls it back into gmem when that does not seem to be
> the direction any architectures other SNP have/want to go.
> 
> > 
> > > My understanding based on prior discussion in guest_memfd calls was that
> > > it would be okay to go ahead and clear the entire folio at initial allocation
> > > time, and basically never mess with it again. It was also my understanding
> > That's where I don't follow in this patch.
> > I don't see where the entire folio A is cleared if it's only partially mapped by
> > kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to
> > kvm_gmem_populate() has set the uptodate flag.
> > 
> > > that for TDX it might even be optimal to completely skip clearing the folio
> > > if it is getting mapped into SecureEPT as a hugepage since the TDX module
> > > would handle that, but that maybe conversely after private->shared there
> > > would be some need to reclear... I'll try to find that discussion and
> > > refresh. Vishal I believe suggested some flags to provide more control over
> > > this behavior.
> > > 
> > > > 
> > > > It's possible (at least for TDX) that a huge folio is only partially populated
> > > > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the
> > > > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region,
> > > > while GFN 0x820 is faulted after TD is running. However, these two GFNs can
> > > > belong to the same folio of order 9.
> > > 
> > > Would the above scheme of clearing the entire folio up front and not re-clearing
> > > at fault time work for this case?
> > This case doesn't affect TDX, because TDX clearing private pages internally in
> > SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage()
> > after making a folio private, it works fine for TDX.
> > 
> > I was just trying to understand why SNP needs the clearing of entire folio in
> > kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's
> > partially mapped in kvm_gmem_populate().
> > Also, I'm wondering if it would be better if SNP could move the clearing of
> > folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare()
> > which is always invoked by kvm_gmem_get_pfn() and the architecture can handle
> > it's own tracking at whatever granularity.
> 
> Possibly, but I touched elsewhere on where in-place conversion might
> trip up this approach. At least decoupling them allows for the prep side
> of things to be moved to architecture-specific tracking. We can deal
> with uptodate separately I think.
> 
> -Mike
> 
> > 
> >  
> > > > Note: the current code should not impact TDX. I'm just asking out of curiosity:)
> > > > 
> > > > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/
> > > > 
> > > >