linux-kernel - Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAGtprH9ELoYmwA+brSx-kWH5qSK==u8huW=4otEZ5evu_GTvtQ@mail.gmail.com>
Date: Mon, 4 Aug 2025 18:20:22 -0700
From: Vishal Annapurve <vannapurve@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Ira Weiny <ira.weiny@...el.com>, Yan Zhao <yan.y.zhao@...el.com>, 
	Michael Roth <michael.roth@....com>, pbonzini@...hat.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, rick.p.edgecombe@...el.com, kai.huang@...el.com, 
	adrian.hunter@...el.com, reinette.chatre@...el.com, xiaoyao.li@...el.com, 
	tony.lindgren@...el.com, binbin.wu@...ux.intel.com, dmatlack@...gle.com, 
	isaku.yamahata@...el.com, david@...hat.com, ackerleytng@...gle.com, 
	tabba@...gle.com, chao.p.peng@...el.com
Subject: Re: [RFC PATCH] KVM: TDX: Decouple TDX init mem region from kvm_gmem_populate()

On Mon, Aug 4, 2025 at 5:22 PM Sean Christopherson <seanjc@...gle.com> wrote:
> > > > > IIUC, the suggestion in the link is to abandon kvm_gmem_populate().
> > > > > For TDX, it means adopting the approach in this RFC patch, right?
> > > > Yes, IMO this RFC is following the right approach as posted.
>
> I don't think we want to abandon kvm_gmem_populate().  Unless I'm missing something,
> SNP has the same AB-BA problem as TDX.  The copy_from_user() on @src can trigger
> a page fault, and resolving the page fault may require taking mm->mmap_lock.
>
> Fundamentally, TDX and SNP are doing the same thing: copying from source to guest
> memory.  The only differences are in the mechanics of the copy+encrypt, everything
> else is the same.  I.e. I don't expect that we'll find a magic solution that works
> well for one and not the other.
>
> I also don't want to end up with wildly different ABI for SNP vs. everything else.
> E.g. cond_resched() needs to be called if the to-be-initialzied range is large,
> which means dropping mmu_lock between pages, whereas kvm_gmem_populate() can
> yield without dropping invalidate_lock, which means that the behavior of populating
> guest_memfd memory will be quite different with respect to guest_memfd operations.

I would think that TDX/CCA VMs [1] will run into the similar behavior
of needing to simulate stage2 faults i.e. KVM will end up picking up
and dropping mmu_lock for each page anyways at least for these two
platforms.

[1] https://lore.kernel.org/kvm/20250611104844.245235-5-steven.price@arm.com/
(rmi_rtt_create())

>
> Pulling in the RFC text:
>
> : I think the only different scenario is SNP, where the host must write
> : initial contents to guest memory.
> :
> : Will this work for all cases CCA/SNP/TDX during initial memory
> : population from within KVM:
> : 1) Simulate stage2 fault
> : 2) Take a KVM mmu read lock
>
> Doing all of this under mmu_lock is pretty much a non-starter.
>
> : 3) Check that the needed gpa is mapped in EPT/NPT entries
>
> No, KVM's page tables are not the source of truth.  S-EPT is a special snowflake,
> and I'd like to avoid foisting the same requirements on NPT.

I agree this would be a new requirement.

>
> : 4) For SNP, if src != null, make the target pfn to be shared, copy
> : contents and then make the target pfn back to private.
>
> Copying from userspace under spinlock (rwlock) is illegal, as accessing userspace
> memory might_fault() and thus might_sleep().

I would think that a combination of get_user_pages() and
kmap_local_pfn() will prevent this situation of might_fault().

>
> : 5) For TDX, if src != null, pass the same address for source and
> : target (likely this works for CCA too)
> : 6) Invoke appropriate memory encryption operations
> : 7) measure contents
> : 8) release the KVM mmu read lock
> :
> : If this scheme works, ideally we should also not call RMP table
> : population logic from guest_memfd, but from KVM NPT fault handling
> : logic directly (a bit of cosmetic change).
>
> LOL, that's not a cosmetic change.  It would be a non-trivial ABI change as KVM's
> ABI (ignoring S-EPT) is that userspace can delete and recreate memslots at will.

Ack, this is not a cosmetic change once we start thinking about how
memory ownership should be tied to memslots/NPT operations.

>
> : Ideally any outgoing interaction from guest_memfd to KVM should be only via
>   invalidation notifiers.
>
> Why?  It's all KVM code.  I don't see how this is any different than e.g. common
> code, arch code, and vendor code all calling into one another.  Artificially
> limiting guest_memfd to a super generic interface pretty much defeats the whole
> purpose of having KVM provide a backing store.

Inline with what we discussed on another thread, it makes sense to
think of guest_memfd as not a super generic interface, but at least
the one that has a very well defined role of supplying memory to KVM
guests (so to userspace and IOMMU) and taking away the memory when
needed. Memory population in my opinion is best solved either by users
asserting ownership of the memory and writing to it directly or by
using guest_memfd (to be) exposed APIs to populate memory ranges given
a source buffer. IMO kvm_gmem_populate() is doing something different
than both of these options.