[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <qlqbmmf2kfk5i7slue42xrb2f7il2scrqgx3xlsvxf4ujuqujb@v454fgzwhn7l>
Date: Wed, 17 Jul 2024 01:42:35 -0500
From: Michael Roth <michael.roth@....com>
To: "Edgecombe, Rick P" <rick.p.edgecombe@...el.com>
CC: "pbonzini@...hat.com" <pbonzini@...hat.com>, "kvm@...r.kernel.org"
<kvm@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "seanjc@...gle.com" <seanjc@...gle.com>
Subject: Re: [PATCH 09/12] KVM: guest_memfd: move check for already-populated
page to common code
On Mon, Jul 15, 2024 at 10:57:35PM +0000, Edgecombe, Rick P wrote:
> On Mon, 2024-07-15 at 16:47 -0500, Michael Roth wrote:
> > Makes sense.
> >
> > If we document mutual exclusion between ranges touched by
> > gmem_populate() vs ranges touched by actual userspace issuance of
> > KVM_PRE_FAULT_MEMORY (and make sure nothing too crazy happens if users
> > don't abide by the documentation), then I think most problems go away...
> >
> > But there is still at least one awkward constraint for SNP:
> > KVM_PRE_FAULT_MEMORY cannot be called for private GPA ranges until after
> > SNP_LAUNCH_START is called. This is true even if the GPA range is not
> > one of the ranges that will get passed to
> > gmem_populate()/SNP_LAUNCH_UPDATE. The reason for this is that when
> > binding the ASID to the SNP context as part of SNP_LAUNCH_START, firmware
> > will perform checks to make sure that ASID is not already being used in
> > the RMP table, and that check will fail if KVM_PRE_FAULT_MEMORY triggered
> > for a private page before calling SNP_LAUNCH_START.
> >
> > So we have this constraint that KVM_PRE_FAULT_MEMORY can't be issued
> > before SNP_LAUNCH_START. So it makes me wonder if we should just broaden
> > that constraint and for now just disallow KVM_PRE_FAULT_MEMORY prior to
> > finalizing a guest, since it'll be easier to lift that restriction later
> > versus discovering some other sort of edge case and need to
> > retroactively place restrictions.
> >
> > I've taken Isaku's original pre_fault_memory_test and added a new
> > x86-specific coco_pre_fault_memory_test to try to better document and
> > exercise these corner cases for SEV and SNP, but I'm hoping it could
> > also be useful for TDX (hence the generic name). These are based on
> > Pratik's initial SNP selftests (which are in turn based on kvm/queue +
> > these patches):
> >
> > https://github.com/mdroth/linux/commits/snp-uptodate0-kst/
> >
> > https://github.com/mdroth/linux/commit/dd7d4b1983ceeb653132cfd54ad63f597db85f74
> >
> >
>
> From the TDX side it wouldn't be horrible to not have to worry about userspace
> mucking around with the mirrored page tables in unexpected ways during the
> special period. TDX already has its own "finalized" state in kvm_tdx, is there
> something similar on the SEV side we could unify with?
While trying to doing pre-mapping in sev_gmem_post_populate() like you've
done below for TDX, I hit another issue that I think would be avoided by
enforcing finalization (or otherwise needs some alternative fix within
this series).
I'd already mentioned the issue with gmem_prepare() putting pages in an
unexpected RMP state if called prior to initializing the same GPA range
in gmem_populate(). This get handled gracefully however when the
firmware call is issued to encrypt/measure the pages and it re-checks
the page's RMP state.
However, if another thread is issuing KVM_PRE_FAULT_MEMORY and triggers
a gmem_prepare() just after gmem_populate() places the pages in a
private RMP state, then gmem_prepare()->kvm_gmem_get_pfn() will trigger
the clear_highpage() call in kvm_gmem_prepare_folio() because the
uptodate flag isn't set until after sev_gmem_post_populate() returns.
So I ended up just calling kvm_arch_vcpu_pre_fault_memory() on the full
GPA range after the post-populate callback returns:
https://github.com/mdroth/linux/commit/da4e7465ced1a708ff1c5e9ab27c570de7e8974e
...
But a much larger concern is that even without this series, but just
plain kvm/queue, KVM_PRE_FAULT_MEMORY can still race with
sev_gmem_post_populate() in bad ways, so I think for 6.11 we should
consider locking this finalization requirement in and applying
something like the below fix:
https://github.com/mdroth/linux/commit/7095ba6ee8f050af11620daaf6c2219fd0bbb1c3
Or if that's potentially too big a chance to decide right now, we could
just make KVM_PRE_FAULT_MEMORY return EOPNOTSUPP for
kvm_arch_has_private_memory() until we've had more time to work out the
missing bits for CoCo there.
-Mike
>
> I looked at moving from kvm_arch_vcpu_pre_fault_memory() to directly calling
> kvm_tdp_map_page(), so we could potentially put whatever check in
> kvm_arch_vcpu_pre_fault_memory(). It required a couple exports:
>
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 03737f3aaeeb..9004ac597a85 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -277,6 +277,7 @@ extern bool tdp_mmu_enabled;
> #endif
>
> int kvm_tdp_mmu_get_walk_mirror_pfn(struct kvm_vcpu *vcpu, u64 gpa, kvm_pfn_t
> *pfn);
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
> *level);
>
> static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
> {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7bb6b17b455f..4a3e471ec9fe 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4721,8 +4721,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct
> kvm_page_fault *fault)
> return direct_page_fault(vcpu, fault);
> }
>
> -static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code,
> - u8 *level)
> +int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t gpa, u64 error_code, u8
> *level)
> {
> int r;
>
> @@ -4759,6 +4758,7 @@ static int kvm_tdp_map_page(struct kvm_vcpu *vcpu, gpa_t
> gpa, u64 error_code,
> return -EIO;
> }
> }
> +EXPORT_SYMBOL_GPL(kvm_tdp_map_page);
>
> long kvm_arch_vcpu_pre_fault_memory(struct kvm_vcpu *vcpu,
> struct kvm_pre_fault_memory *range)
> @@ -5770,6 +5770,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
> out:
> return r;
> }
> +EXPORT_SYMBOL_GPL(kvm_mmu_load);
>
> void kvm_mmu_unload(struct kvm_vcpu *vcpu)
> {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 9ac0821eb44b..7161ef68f3da 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -2809,11 +2809,13 @@ struct tdx_gmem_post_populate_arg {
> static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
> void __user *src, int order, void *_arg)
> {
> + u64 error_code = PFERR_GUEST_FINAL_MASK | PFERR_PRIVATE_ACCESS;
> struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> struct tdx_gmem_post_populate_arg *arg = _arg;
> struct kvm_vcpu *vcpu = arg->vcpu;
> struct kvm_memory_slot *slot;
> gpa_t gpa = gfn_to_gpa(gfn);
> + u8 level = PG_LEVEL_4K;
> struct page *page;
> kvm_pfn_t mmu_pfn;
> int ret, i;
> @@ -2832,6 +2834,10 @@ static int tdx_gmem_post_populate(struct kvm *kvm, gfn_t
> gfn, kvm_pfn_t pfn,
> goto out_put_page;
> }
>
> + ret = kvm_tdp_map_page(vcpu, gpa, error_code, &level);
> + if (ret < 0)
> + goto out_put_page;
> +
> read_lock(&kvm->mmu_lock);
>
> ret = kvm_tdp_mmu_get_walk_mirror_pfn(vcpu, gpa, &mmu_pfn);
> @@ -2910,6 +2916,7 @@ static int tdx_vcpu_init_mem_region(struct kvm_vcpu *vcpu,
> struct kvm_tdx_cmd *c
> mutex_lock(&kvm->slots_lock);
> idx = srcu_read_lock(&kvm->srcu);
>
> + kvm_mmu_reload(vcpu);
> ret = 0;
> while (region.nr_pages) {
> if (signal_pending(current)) {
>
Powered by blists - more mailing lists