[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5e2d0b4c-fd1b-41f0-a099-64c897b7ec6b@amd.com>
Date: Tue, 15 Jul 2025 12:07:13 +0530
From: "Aithal, Srikanth" <sraithal@....com>
To: Sean Christopherson <seanjc@...gle.com>, Zheyun Shen <szy0127@...u.edu.cn>
Cc: linux-next@...r.kernel.org, kvm@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM
SEV migration kselftest on AMD platform
On 7/15/2025 3:47 AM, Sean Christopherson wrote:
> On Mon, Jul 14, 2025, Sean Christopherson wrote:
>> So as much as I want to avoid allocating another cpumask (ugh), it's the right
>> thing to do. And practically speaking, I doubt many real world users of SEV will
>> be using MAXSMP, i.e. the allocations don't exist anyways.
>>
>> Unless someone objects and/or has a better idea, I'll squash this:
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 95668e84ab86..e39726d258b8 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2072,6 +2072,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>> if (ret)
>> goto out_source_vcpu;
>>
>> + /*
>> + * Allocate a new have_run_cpus for the destination, i.e. don't copy
>> + * the set of CPUs from the source. If a CPU was used to run a vCPU in
>> + * the source VM but is never used for the destination VM, then the CPU
>> + * can only have cached memory that was accessible to the source VM.
>> + */
>> + if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
>> + ret = -ENOMEM;
>> + goto out_source_vcpu;
>> + }
>> +
>> sev_migrate_from(kvm, source_kvm);
>> kvm_vm_dead(source_kvm);
>> cg_cleanup_sev = src_sev;
>> @@ -2771,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>> goto e_unlock;
>> }
>>
>> + mirror_sev = to_kvm_sev_info(kvm);
>> + if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
>> + ret = -ENOMEM;
>> + goto e_unlock;
>> + }
>> +
>> /*
>> * The mirror kvm holds an enc_context_owner ref so its asid can't
>> * disappear until we're done with it
>> */
>> source_sev = to_kvm_sev_info(source_kvm);
>> kvm_get_kvm(source_kvm);
>> - mirror_sev = to_kvm_sev_info(kvm);
>> list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
>>
>> /* Set enc_context_owner and copy its encryption context over */
>
> This isn't quite right either, because sev_vm_destroy() won't free the cpumask
> for mirror VMs.
>
> Aha! And KVM will also unnecessarily leak have_run_cpus if SNP decomission
> fails (though that should be an extremely rare error scecnario).
>
> KVM is guaranteed to have blasted WBINVD before reaching sev_vm_destroy() (see
> commit 7e00013bd339 "KVM: SVM: Remove wbinvd in sev_vm_destroy()"), so unless I'm
> missing something, KVM can simply free have_run_cpus at the start of sev_vm_destroy().
>
> Ooh, side topic! The fact that sev_vm_destroy() wasn't blasting WBINVD would
> have been a bug if not for kvm_arch_guest_memory_reclaimed() and
> kvm_arch_gmem_invalidate() taking care of mirror VMs.
>
> New hash for the patch:
>
> KVM: SVM: Flush cache only on CPUs running SEV guest
> https://github.com/kvm-x86/linux/commit/6f38f8c57464
kselftest sev_migrate_tests passes with current
https://github.com/kvm-x86/linux/tree/next (head 2a046f6), which has
commit 6f38f8c.
Reported-by: Srikanth Aithal <sraithal@....com>
Tested-by: Srikanth Aithal <sraithal@....com>
>
> And the full contexts of what I force-pushed:
>
> --
> From: Zheyun Shen <szy0127@...u.edu.cn>
> Date: Thu, 22 May 2025 16:37:32 -0700
> Subject: [PATCH] KVM: SVM: Flush cache only on CPUs running SEV guest
>
> On AMD CPUs without ensuring cache consistency, each memory page
> reclamation in an SEV guest triggers a call to do WBNOINVD/WBINVD on all
> CPUs, thereby affecting the performance of other programs on the host.
>
> Typically, an AMD server may have 128 cores or more, while the SEV guest
> might only utilize 8 of these cores. Meanwhile, host can use qemu-affinity
> to bind these 8 vCPUs to specific physical CPUs.
>
> Therefore, keeping a record of the physical core numbers each time a vCPU
> runs can help avoid flushing the cache for all CPUs every time.
>
> Take care to allocate the cpumask used to track which CPUs have run a
> vCPU when copying or moving an "encryption context", as nothing guarantees
> memory in a mirror VM is a strict subset of the ASID owner, and the
> destination VM for intrahost migration needs to maintain it's own set of
> CPUs. E.g. for intrahost migration, if a CPU was used for the source VM
> but not the destination VM, then it can only have cached memory that was
> accessible to the source VM. And a CPU that was run in the source is also
> used by the destination is no different than a CPU that was run in the
> destination only.
>
> Note, KVM is guaranteed to do flush caches prior to sev_vm_destroy(),
> thanks to kvm_arch_guest_memory_reclaimed for SEV and SEV-ES, and
> kvm_arch_gmem_invalidate() for SEV-SNP. I.e. it's safe to free the
> cpumask prior to unregistering encrypted regions and freeing the ASID.
>
> Cc: Srikanth Aithal <sraithal@....com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@....com>
> Signed-off-by: Zheyun Shen <szy0127@...u.edu.cn>
> Co-developed-by: Sean Christopherson <seanjc@...gle.com>
> Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com
> Link: https://lore.kernel.org/all/935a82e3-f7ad-47d7-aaaf-f3d2b62ed768@amd.com
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
> arch/x86/kvm/svm/sev.c | 71 ++++++++++++++++++++++++++++++++++++------
> arch/x86/kvm/svm/svm.h | 1 +
> 2 files changed, 63 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index ed39f8a4d9df..a62cd27a4f45 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -447,7 +447,12 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
> init_args.probe = false;
> ret = sev_platform_init(&init_args);
> if (ret)
> - goto e_free;
> + goto e_free_asid;
> +
> + if (!zalloc_cpumask_var(&sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> + ret = -ENOMEM;
> + goto e_free_asid;
> + }
>
> /* This needs to happen after SEV/SNP firmware initialization. */
> if (vm_type == KVM_X86_SNP_VM) {
> @@ -465,6 +470,8 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
> return 0;
>
> e_free:
> + free_cpumask_var(sev->have_run_cpus);
> +e_free_asid:
> argp->error = init_args.error;
> sev_asid_free(sev);
> sev->asid = 0;
> @@ -709,16 +716,31 @@ static void sev_clflush_pages(struct page *pages[], unsigned long npages)
> }
> }
>
> -static void sev_writeback_caches(void)
> +static void sev_writeback_caches(struct kvm *kvm)
> {
> + /*
> + * Note, the caller is responsible for ensuring correctness if the mask
> + * can be modified, e.g. if a CPU could be doing VMRUN.
> + */
> + if (cpumask_empty(to_kvm_sev_info(kvm)->have_run_cpus))
> + return;
> +
> /*
> * Ensure that all dirty guest tagged cache entries are written back
> * before releasing the pages back to the system for use. CLFLUSH will
> * not do this without SME_COHERENT, and flushing many cache lines
> * individually is slower than blasting WBINVD for large VMs, so issue
> - * WBNOINVD (or WBINVD if the "no invalidate" variant is unsupported).
> + * WBNOINVD (or WBINVD if the "no invalidate" variant is unsupported)
> + * on CPUs that have done VMRUN, i.e. may have dirtied data using the
> + * VM's ASID.
> + *
> + * For simplicity, never remove CPUs from the bitmap. Ideally, KVM
> + * would clear the mask when flushing caches, but doing so requires
> + * serializing multiple calls and having responding CPUs (to the IPI)
> + * mark themselves as still running if they are running (or about to
> + * run) a vCPU for the VM.
> */
> - wbnoinvd_on_all_cpus();
> + wbnoinvd_on_cpus_mask(to_kvm_sev_info(kvm)->have_run_cpus);
> }
>
> static unsigned long get_num_contig_pages(unsigned long idx,
> @@ -2046,6 +2068,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
> if (ret)
> goto out_source_vcpu;
>
> + /*
> + * Allocate a new have_run_cpus for the destination, i.e. don't copy
> + * the set of CPUs from the source. If a CPU was used to run a vCPU in
> + * the source VM but is never used for the destination VM, then the CPU
> + * can only have cached memory that was accessible to the source VM.
> + */
> + if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> + ret = -ENOMEM;
> + goto out_source_vcpu;
> + }
> +
> sev_migrate_from(kvm, source_kvm);
> kvm_vm_dead(source_kvm);
> cg_cleanup_sev = src_sev;
> @@ -2707,7 +2740,7 @@ int sev_mem_enc_unregister_region(struct kvm *kvm,
> goto failed;
> }
>
> - sev_writeback_caches();
> + sev_writeback_caches(kvm);
>
> __unregister_enc_region_locked(kvm, region);
>
> @@ -2749,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
> goto e_unlock;
> }
>
> + mirror_sev = to_kvm_sev_info(kvm);
> + if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> + ret = -ENOMEM;
> + goto e_unlock;
> + }
> +
> /*
> * The mirror kvm holds an enc_context_owner ref so its asid can't
> * disappear until we're done with it
> */
> source_sev = to_kvm_sev_info(source_kvm);
> kvm_get_kvm(source_kvm);
> - mirror_sev = to_kvm_sev_info(kvm);
> list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
>
> /* Set enc_context_owner and copy its encryption context over */
> @@ -2817,7 +2855,13 @@ void sev_vm_destroy(struct kvm *kvm)
>
> WARN_ON(!list_empty(&sev->mirror_vms));
>
> - /* If this is a mirror_kvm release the enc_context_owner and skip sev cleanup */
> + free_cpumask_var(sev->have_run_cpus);
> +
> + /*
> + * If this is a mirror VM, remove it from the owner's list of a mirrors
> + * and skip ASID cleanup (the ASID is tied to the lifetime of the owner).
> + * Note, mirror VMs don't support registering encrypted regions.
> + */
> if (is_mirroring_enc_context(kvm)) {
> struct kvm *owner_kvm = sev->enc_context_owner;
>
> @@ -3106,7 +3150,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
> return;
>
> do_sev_writeback_caches:
> - sev_writeback_caches();
> + sev_writeback_caches(vcpu->kvm);
> }
>
> void sev_guest_memory_reclaimed(struct kvm *kvm)
> @@ -3119,7 +3163,7 @@ void sev_guest_memory_reclaimed(struct kvm *kvm)
> if (!sev_guest(kvm) || sev_snp_guest(kvm))
> return;
>
> - sev_writeback_caches();
> + sev_writeback_caches(kvm);
> }
>
> void sev_free_vcpu(struct kvm_vcpu *vcpu)
> @@ -3451,6 +3495,15 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu)
> if (sev_es_guest(kvm) && !VALID_PAGE(svm->vmcb->control.vmsa_pa))
> return -EINVAL;
>
> + /*
> + * To optimize cache flushes when memory is reclaimed from an SEV VM,
> + * track physical CPUs that enter the guest for SEV VMs and thus can
> + * have encrypted, dirty data in the cache, and flush caches only for
> + * CPUs that have entered the guest.
> + */
> + if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus))
> + cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
> +
> /* Assign the asid allocated with this SEV guest */
> svm->asid = asid;
>
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index e6f3c6a153a0..a7c6f07260cf 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -113,6 +113,7 @@ struct kvm_sev_info {
> void *guest_req_buf; /* Bounce buffer for SNP Guest Request input */
> void *guest_resp_buf; /* Bounce buffer for SNP Guest Request output */
> struct mutex guest_req_mutex; /* Must acquire before using bounce buffers */
> + cpumask_var_t have_run_cpus; /* CPUs that have done VMRUN for this VM. */
> };
>
> #define SEV_POLICY_NODBG BIT_ULL(0)
>
> base-commit: a77896eea33db6fe393d1db1380e2e52f74546a2
> --
Powered by blists - more mailing lists