[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aSUe1UfD3hXg2iMZ@yzhao56-desk.sh.intel.com>
Date: Tue, 25 Nov 2025 11:13:25 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Michael Roth <michael.roth@....com>
CC: <kvm@...r.kernel.org>, <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, <thomas.lendacky@....com>,
<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
<ashish.kalra@....com>, <liam.merwick@...cle.com>, <david@...hat.com>,
<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <aik@....com>,
<ira.weiny@...el.com>
Subject: Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking
On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote:
> On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote:
> > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote:
> > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > {
> > > pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > struct folio *folio;
> > > - bool is_prepared = false;
> > > int r = 0;
> > >
> > > CLASS(gmem_get_file, file)(slot);
> > > if (!file)
> > > return -EFAULT;
> > >
> > > - folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > > + folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> > > if (IS_ERR(folio))
> > > return PTR_ERR(folio);
> > >
> > > - if (!is_prepared)
> > > - r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > > + if (!folio_test_uptodate(folio)) {
> > > + unsigned long i, nr_pages = folio_nr_pages(folio);
> > > +
> > > + for (i = 0; i < nr_pages; i++)
> > > + clear_highpage(folio_page(folio, i));
> > > + folio_mark_uptodate(folio);
> > Here, the entire folio is cleared only when the folio is not marked uptodate.
> > Then, please check my questions at the bottom
>
> Yes, in this patch at least where I tried to mirror the current logic. I
> would not be surprised if we need to rework things for inplace/hugepage
> support though, but decoupling 'preparation' from the uptodate flag is
> the main goal here.
Could you elaborate a little why the decoupling is needed if it's not for
hugepage?
> > > + }
> > > +
> > > + r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > >
> > > folio_unlock(folio);
> > >
> > > @@ -852,7 +843,6 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > struct folio *folio;
> > > gfn_t gfn = start_gfn + i;
> > > pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > - bool is_prepared = false;
> > > kvm_pfn_t pfn;
> > >
> > > if (signal_pending(current)) {
> > > @@ -860,19 +850,12 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > break;
> > > }
> > >
> > > - folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &is_prepared, &max_order);
> > > + folio = __kvm_gmem_get_pfn(file, slot, index, &pfn, &max_order);
> > > if (IS_ERR(folio)) {
> > > ret = PTR_ERR(folio);
> > > break;
> > > }
> > >
> > > - if (is_prepared) {
> > > - folio_unlock(folio);
> > > - folio_put(folio);
> > > - ret = -EEXIST;
> > > - break;
> > > - }
> > > -
> > > folio_unlock(folio);
> > > WARN_ON(!IS_ALIGNED(gfn, 1 << max_order) ||
> > > (npages - i) < (1 << max_order));
> > TDX could hit this warning easily when npages == 1, max_order == 9.
>
> Yes, this will need to change to handle that. I don't think I had to
> change this for previous iterations of SNP hugepage support, but
> there are definitely cases where a sub-2M range might get populated
> even though it's backed by a 2M folio, so I'm not sure why I didn't
> hit it there.
>
> But I'm taking Sean's cue on touching as little of the existing
> hugepage logic as possible in this particular series so we can revisit
> the remaining changes with some better context.
Frankly, I don't understand why this patch 1 is required if we only want "moving
GUP out of post_populate()" to work for 4KB folios.
> >
> > > @@ -889,7 +872,7 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
> > > p = src ? src + i * PAGE_SIZE : NULL;
> > > ret = post_populate(kvm, gfn, pfn, p, max_order, opaque);
> > > if (!ret)
> > > - kvm_gmem_mark_prepared(folio);
> > > + folio_mark_uptodate(folio);
> > As also asked in [1], why is the entire folio marked as uptodate here? Why does
> > kvm_gmem_get_pfn() clear all pages of a huge folio when the folio isn't marked
> > uptodate?
>
> Quoting your example from[1] for more context:
>
> > I also have a question about this patch:
> >
> > Suppose there's a 2MB huge folio A, where
> > A1 and A2 are 4KB pages belonging to folio A.
> >
> > (1) kvm_gmem_populate() invokes __kvm_gmem_get_pfn() and gets folio A.
> > It adds page A1 and invokes folio_mark_uptodate() on folio A.
>
> In SNP hugepage patchset you responded to, it would only mark A1 as
You mean code in
https://github.com/amdese/linux/commits/snp-inplace-conversion-rfc1 ?
> prepared/cleared. There was 4K-granularity tracking added to handle this.
I don't find the code that marks only A1 as "prepared/cleared".
Instead, I just found folio_mark_uptodate() is invoked by kvm_gmem_populate()
to mark the entire folio A as uptodate.
However, according to your statement below that "uptodate flag only tracks
whether a folio has been cleared", I don't follow why and where the entire folio
A would be cleared if kvm_gmem_populate() only adds page A1.
> There was an odd subtlety in that series though: it was defaulting to the
> folio_order() for the prep-tracking/post-populate, but it would then clamp
> it down based on the max order possible according whether that particular
> order was a homogenous range of KVM_MEMORY_ATTRIBUTE_PRIVATE. Which is not
> a great way to handle things, and I don't remember if I'd actually intended
> to implement it that way or not... that's probably why I never tripped over
> the WARN_ON() above, now that I think of it.
>
> But neither of these these apply to any current plans for hugepage support
> that I'm aware of, so probably not worth working through what that series
> did and look at this from a fresh perspective.
>
> >
> > (2) kvm_gmem_get_pfn() later faults in page A2.
> > As folio A is uptodate, clear_highpage() is not invoked on page A2.
> > kvm_gmem_prepare_folio() is invoked on the whole folio A.
> >
> > (2) could occur at least in TDX when only a part the 2MB page is added as guest
> > initial memory.
> >
> > My questions:
> > - Would (2) occur on SEV?
> > - If it does, is the lack of clear_highpage() on A2 a problem ?
> > - Is invoking gmem_prepare on page A1 a problem?
>
> Assuming this patch goes upstream in some form, we will now have the
> following major differences versus previous code:
>
> 1) uptodate flag only tracks whether a folio has been cleared
> 2) gmem always calls kvm_arch_gmem_prepare() via kvm_gmem_get_pfn() and
> the architecture can handle it's own tracking at whatever granularity
> it likes.
2) looks good to me.
> My hope is that 1) can similarly be done in such a way that gmem does not
> need to track things at sub-hugepage granularity and necessitate the need
> for some new data structure/state/flag to track sub-page status.
I actually don't understand what uptodate flag helps gmem to track.
Why can't clear_highpage() be done inside arch specific code? TDX doesn't need
this clearing after all.
> My understanding based on prior discussion in guest_memfd calls was that
> it would be okay to go ahead and clear the entire folio at initial allocation
> time, and basically never mess with it again. It was also my understanding
That's where I don't follow in this patch.
I don't see where the entire folio A is cleared if it's only partially mapped by
kvm_gmem_populate(). kvm_gmem_get_pfn() won't clear folio A either due to
kvm_gmem_populate() has set the uptodate flag.
> that for TDX it might even be optimal to completely skip clearing the folio
> if it is getting mapped into SecureEPT as a hugepage since the TDX module
> would handle that, but that maybe conversely after private->shared there
> would be some need to reclear... I'll try to find that discussion and
> refresh. Vishal I believe suggested some flags to provide more control over
> this behavior.
>
> >
> > It's possible (at least for TDX) that a huge folio is only partially populated
> > by kvm_gmem_populate(). Then kvm_gmem_get_pfn() faults in another part of the
> > huge folio. For example, in TDX, GFN 0x81f belongs to the init memory region,
> > while GFN 0x820 is faulted after TD is running. However, these two GFNs can
> > belong to the same folio of order 9.
>
> Would the above scheme of clearing the entire folio up front and not re-clearing
> at fault time work for this case?
This case doesn't affect TDX, because TDX clearing private pages internally in
SEAM APIs. So, as long as kvm_gmem_get_pfn() does not invoke clear_highpage()
after making a folio private, it works fine for TDX.
I was just trying to understand why SNP needs the clearing of entire folio in
kvm_gmem_get_pfn() while I don't see how the entire folio is cleared when it's
partially mapped in kvm_gmem_populate().
Also, I'm wondering if it would be better if SNP could move the clearing of
folio into something like kvm_arch_gmem_clear(), just as kvm_arch_gmem_prepare()
which is always invoked by kvm_gmem_get_pfn() and the architecture can handle
it's own tracking at whatever granularity.
> > Note: the current code should not impact TDX. I'm just asking out of curiosity:)
> >
> > [1] https://lore.kernel.org/all/aQ3uj4BZL6uFQzrD@yzhao56-desk.sh.intel.com/
> >
> >
Powered by blists - more mailing lists