linux-kernel - Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aTJXXMKtfcud7ctn@yzhao56-desk.sh.intel.com>
Date: Fri, 5 Dec 2025 11:54:04 +0800
From: Yan Zhao <yan.y.zhao@...el.com>
To: Michael Roth <michael.roth@....com>
CC: <kvm@...r.kernel.org>, <linux-coco@...ts.linux.dev>, <linux-mm@...ck.org>,
	<linux-kernel@...r.kernel.org>, <thomas.lendacky@....com>,
	<pbonzini@...hat.com>, <seanjc@...gle.com>, <vbabka@...e.cz>,
	<ashish.kalra@....com>, <liam.merwick@...cle.com>, <david@...hat.com>,
	<vannapurve@...gle.com>, <ackerleytng@...gle.com>, <aik@....com>,
	<ira.weiny@...el.com>
Subject: Re: [PATCH 1/3] KVM: guest_memfd: Remove preparation tracking

On Wed, Dec 03, 2025 at 07:47:17AM -0600, Michael Roth wrote:
> On Tue, Dec 02, 2025 at 05:17:20PM +0800, Yan Zhao wrote:
> > On Mon, Dec 01, 2025 at 05:44:47PM -0600, Michael Roth wrote:
> > > On Tue, Nov 25, 2025 at 11:13:25AM +0800, Yan Zhao wrote:
> > > > On Fri, Nov 21, 2025 at 06:43:14AM -0600, Michael Roth wrote:
> > > > > On Thu, Nov 20, 2025 at 05:12:55PM +0800, Yan Zhao wrote:
> > > > > > On Thu, Nov 13, 2025 at 05:07:57PM -0600, Michael Roth wrote:
> > > > > > > @@ -797,19 +782,25 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
> > > > > > >  {
> > > > > > >  	pgoff_t index = kvm_gmem_get_index(slot, gfn);
> > > > > > >  	struct folio *folio;
> > > > > > > -	bool is_prepared = false;
> > > > > > >  	int r = 0;
> > > > > > >  
> > > > > > >  	CLASS(gmem_get_file, file)(slot);
> > > > > > >  	if (!file)
> > > > > > >  		return -EFAULT;
> > > > > > >  
> > > > > > > -	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, &is_prepared, max_order);
> > > > > > > +	folio = __kvm_gmem_get_pfn(file, slot, index, pfn, max_order);
> > > > > > >  	if (IS_ERR(folio))
> > > > > > >  		return PTR_ERR(folio);
> > > > > > >  
> > > > > > > -	if (!is_prepared)
> > > > > > > -		r = kvm_gmem_prepare_folio(kvm, slot, gfn, folio);
> > > > > > > +	if (!folio_test_uptodate(folio)) {
> > > > > > > +		unsigned long i, nr_pages = folio_nr_pages(folio);
> > > > > > > +
> > > > > > > +		for (i = 0; i < nr_pages; i++)
> > > > > > > +			clear_highpage(folio_page(folio, i));
> > > > > > > +		folio_mark_uptodate(folio);
> > > > > > Here, the entire folio is cleared only when the folio is not marked uptodate.
> > > > > > Then, please check my questions at the bottom
> > > > > 
> > > > > Yes, in this patch at least where I tried to mirror the current logic. I
> > > > > would not be surprised if we need to rework things for inplace/hugepage
> > > > > support though, but decoupling 'preparation' from the uptodate flag is
> > > > > the main goal here.
> > > > Could you elaborate a little why the decoupling is needed if it's not for
> > > > hugepage?
> > > 
> > > For instance, for in-place conversion:
> > > 
> > >   1. initial allocation: clear, set uptodate, fault in as private
> > >   2. private->shared: call invalidate hook, fault in as shared
> > >   3. shared->private: call prep hook, fault in as private
> > > 
> > > Here, 2/3 need to track where the current state is shared/private in
> > > order to make appropriate architecture-specific changes (e.g. RMP table
> > > updates). But we want to allow for non-destructive in-place conversion,
> > > where a page is 'uptodate', but not in the desired shared/private state.
> > > So 'uptodate' becomes a separate piece of state, which is still
> > > reasonable for gmem to track in the current 4K-only implementation, and
> > > provides for a reasonable approach to upstreaming in-place conversion,
> > > which isn't far off for either SNP or TDX.
> > To me, "1. initial allocation: clear, set uptodate" is more appropriate to
> > be done in kvm_gmem_get_folio(), instead of in kvm_gmem_get_pfn().
> 
> The downside is that preallocating originally involved just
> preallocating, and zero'ing would happen lazily during fault time. (or
> upfront via KVM_PRE_FAULT_MEMORY).
> 
> But in the context of the hugetlb RFC, it certainly looks cleaner to do
> it there, since it could be done before any potential splitting activity,
> so then all the tail pages can inherit that initial uptodate flag.
> 
> We'd probably want to weigh that these trade-offs based on how it will
> affect hugepages, but that would be clearer in the context of a new
> posting of hugepage support on top of these changes. So I think it's
> better to address that change as a follow-up so we can consider it with
> more context.
> 
> > 
> > With it, below looks reasonable to me.
> > > For hugepages, we'll have other things to consider, but those things are
> > > probably still somewhat far off, and so we shouldn't block steps toward
> > > in-place conversion based on uncertainty around hugepages. I think it's
> > > gotten enough attention at least that we know it *can* work, e.g. even
> > > if we take the inefficient/easy route of zero'ing the whole folio on
> > > initial access, setting it uptodate, and never doing anything with 
> > > uptodate again, it's still a usable implementation.
> 
> To me as well. But in the context of this series, against kvm/next, it
> creates userspace-visible changes to pre-allocation behavior that we
> can't justify in the context of this series alone, so as above I think
> that's something to save for hugepage follow-up.
> 
> FWIW though, I ended up taking this approach for the hugetlb-based test
> branch, to address the fact (after you reminded me) that I wasn't fully
> zero'ing folio's in the kvm_gmem_populate() path:
> 
>   https://github.com/AMDESE/linux/commit/253fb30b076d81232deba0b02db070d5bc2b90b5
> 
> So maybe for your testing you could do similar. For newer hugepage
> support I'll likely do similar, but I don't think any of this reasoning
> or changes makes sense to people reviewing this series without already
> being aware of hugepage plans/development, so that's why I'm trying to
> keep this series more focused on in-place conversion enablement, because
> hugepage plans might be massively reworked for next posting based on LPC
> talks and changes in direction mentioned in recent guest_memfd calls and
> we are basically just hand-waving about what it will look like at this
> point while blocking other efforts.
> 
Got it. Thanks for the explanation!