linux-kernel - Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM SEV migration kselftest on AMD platform

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5e2d0b4c-fd1b-41f0-a099-64c897b7ec6b@amd.com>
Date: Tue, 15 Jul 2025 12:07:13 +0530
From: "Aithal, Srikanth" <sraithal@....com>
To: Sean Christopherson <seanjc@...gle.com>, Zheyun Shen <szy0127@...u.edu.cn>
Cc: linux-next@...r.kernel.org, kvm@...r.kernel.org,
 linux-kernel@...r.kernel.org
Subject: Re: [BUG] NULL pointer dereference in sev_writeback_caches during KVM
 SEV migration kselftest on AMD platform

On 7/15/2025 3:47 AM, Sean Christopherson wrote:
> On Mon, Jul 14, 2025, Sean Christopherson wrote:
>> So as much as I want to avoid allocating another cpumask (ugh), it's the right
>> thing to do.  And practically speaking, I doubt many real world users of SEV will
>> be using MAXSMP, i.e. the allocations don't exist anyways.
>>
>> Unless someone objects and/or has a better idea, I'll squash this:
>>
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 95668e84ab86..e39726d258b8 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -2072,6 +2072,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>>          if (ret)
>>                  goto out_source_vcpu;
>>   
>> +       /*
>> +        * Allocate a new have_run_cpus for the destination, i.e. don't copy
>> +        * the set of CPUs from the source.  If a CPU was used to run a vCPU in
>> +        * the source VM but is never used for the destination VM, then the CPU
>> +        * can only have cached memory that was accessible to the source VM.
>> +        */
>> +       if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
>> +               ret = -ENOMEM;
>> +               goto out_source_vcpu;
>> +       }
>> +
>>          sev_migrate_from(kvm, source_kvm);
>>          kvm_vm_dead(source_kvm);
>>          cg_cleanup_sev = src_sev;
>> @@ -2771,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>>                  goto e_unlock;
>>          }
>>   
>> +       mirror_sev = to_kvm_sev_info(kvm);
>> +       if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
>> +               ret = -ENOMEM;
>> +               goto e_unlock;
>> +       }
>> +
>>          /*
>>           * The mirror kvm holds an enc_context_owner ref so its asid can't
>>           * disappear until we're done with it
>>           */
>>          source_sev = to_kvm_sev_info(source_kvm);
>>          kvm_get_kvm(source_kvm);
>> -       mirror_sev = to_kvm_sev_info(kvm);
>>          list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
>>   
>>          /* Set enc_context_owner and copy its encryption context over */
> 
> This isn't quite right either, because sev_vm_destroy() won't free the cpumask
> for mirror VMs.
> 
> Aha!  And KVM will also unnecessarily leak have_run_cpus if SNP decomission
> fails (though that should be an extremely rare error scecnario).
> 
> KVM is guaranteed to have blasted WBINVD before reaching sev_vm_destroy() (see
> commit 7e00013bd339 "KVM: SVM: Remove wbinvd in sev_vm_destroy()"), so unless I'm
> missing something, KVM can simply free have_run_cpus at the start of sev_vm_destroy().
> 
> Ooh, side topic!  The fact that sev_vm_destroy() wasn't blasting WBINVD would
> have been a bug if not for kvm_arch_guest_memory_reclaimed() and
> kvm_arch_gmem_invalidate() taking care of mirror VMs.
> 
> New hash for the patch:
> 
>    KVM: SVM: Flush cache only on CPUs running SEV guest
>    https://github.com/kvm-x86/linux/commit/6f38f8c57464


kselftest sev_migrate_tests passes with current 
https://github.com/kvm-x86/linux/tree/next (head 2a046f6), which has 
commit 6f38f8c.


Reported-by: Srikanth Aithal <sraithal@....com>
Tested-by: Srikanth Aithal <sraithal@....com>


> 
> And the full contexts of what I force-pushed:
> 
> --
> From: Zheyun Shen <szy0127@...u.edu.cn>
> Date: Thu, 22 May 2025 16:37:32 -0700
> Subject: [PATCH] KVM: SVM: Flush cache only on CPUs running SEV guest
> 
> On AMD CPUs without ensuring cache consistency, each memory page
> reclamation in an SEV guest triggers a call to do WBNOINVD/WBINVD on all
> CPUs, thereby affecting the performance of other programs on the host.
> 
> Typically, an AMD server may have 128 cores or more, while the SEV guest
> might only utilize 8 of these cores. Meanwhile, host can use qemu-affinity
> to bind these 8 vCPUs to specific physical CPUs.
> 
> Therefore, keeping a record of the physical core numbers each time a vCPU
> runs can help avoid flushing the cache for all CPUs every time.
> 
> Take care to allocate the cpumask used to track which CPUs have run a
> vCPU when copying or moving an "encryption context", as nothing guarantees
> memory in a mirror VM is a strict subset of the ASID owner, and the
> destination VM for intrahost migration needs to maintain it's own set of
> CPUs.  E.g. for intrahost migration, if a CPU was used for the source VM
> but not the destination VM, then it can only have cached memory that was
> accessible to the source VM.  And a CPU that was run in the source is also
> used by the destination is no different than a CPU that was run in the
> destination only.
> 
> Note, KVM is guaranteed to do flush caches prior to sev_vm_destroy(),
> thanks to kvm_arch_guest_memory_reclaimed for SEV and SEV-ES, and
> kvm_arch_gmem_invalidate() for SEV-SNP.  I.e. it's safe to free the
> cpumask prior to unregistering encrypted regions and freeing the ASID.
> 
> Cc: Srikanth Aithal <sraithal@....com>
> Reviewed-by: Tom Lendacky <thomas.lendacky@....com>
> Signed-off-by: Zheyun Shen <szy0127@...u.edu.cn>
> Co-developed-by: Sean Christopherson <seanjc@...gle.com>
> Link: https://lore.kernel.org/r/20250522233733.3176144-9-seanjc@google.com
> Link: https://lore.kernel.org/all/935a82e3-f7ad-47d7-aaaf-f3d2b62ed768@amd.com
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>   arch/x86/kvm/svm/sev.c | 71 ++++++++++++++++++++++++++++++++++++------
>   arch/x86/kvm/svm/svm.h |  1 +
>   2 files changed, 63 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index ed39f8a4d9df..a62cd27a4f45 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -447,7 +447,12 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
>   	init_args.probe = false;
>   	ret = sev_platform_init(&init_args);
>   	if (ret)
> -		goto e_free;
> +		goto e_free_asid;
> +
> +	if (!zalloc_cpumask_var(&sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> +		ret = -ENOMEM;
> +		goto e_free_asid;
> +	}
>   
>   	/* This needs to happen after SEV/SNP firmware initialization. */
>   	if (vm_type == KVM_X86_SNP_VM) {
> @@ -465,6 +470,8 @@ static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp,
>   	return 0;
>   
>   e_free:
> +	free_cpumask_var(sev->have_run_cpus);
> +e_free_asid:
>   	argp->error = init_args.error;
>   	sev_asid_free(sev);
>   	sev->asid = 0;
> @@ -709,16 +716,31 @@ static void sev_clflush_pages(struct page *pages[], unsigned long npages)
>   	}
>   }
>   
> -static void sev_writeback_caches(void)
> +static void sev_writeback_caches(struct kvm *kvm)
>   {
> +	/*
> +	 * Note, the caller is responsible for ensuring correctness if the mask
> +	 * can be modified, e.g. if a CPU could be doing VMRUN.
> +	 */
> +	if (cpumask_empty(to_kvm_sev_info(kvm)->have_run_cpus))
> +		return;
> +
>   	/*
>   	 * Ensure that all dirty guest tagged cache entries are written back
>   	 * before releasing the pages back to the system for use.  CLFLUSH will
>   	 * not do this without SME_COHERENT, and flushing many cache lines
>   	 * individually is slower than blasting WBINVD for large VMs, so issue
> -	 * WBNOINVD (or WBINVD if the "no invalidate" variant is unsupported).
> +	 * WBNOINVD (or WBINVD if the "no invalidate" variant is unsupported)
> +	 * on CPUs that have done VMRUN, i.e. may have dirtied data using the
> +	 * VM's ASID.
> +	 *
> +	 * For simplicity, never remove CPUs from the bitmap.  Ideally, KVM
> +	 * would clear the mask when flushing caches, but doing so requires
> +	 * serializing multiple calls and having responding CPUs (to the IPI)
> +	 * mark themselves as still running if they are running (or about to
> +	 * run) a vCPU for the VM.
>   	 */
> -	wbnoinvd_on_all_cpus();
> +	wbnoinvd_on_cpus_mask(to_kvm_sev_info(kvm)->have_run_cpus);
>   }
>   
>   static unsigned long get_num_contig_pages(unsigned long idx,
> @@ -2046,6 +2068,17 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>   	if (ret)
>   		goto out_source_vcpu;
>   
> +	/*
> +	 * Allocate a new have_run_cpus for the destination, i.e. don't copy
> +	 * the set of CPUs from the source.  If a CPU was used to run a vCPU in
> +	 * the source VM but is never used for the destination VM, then the CPU
> +	 * can only have cached memory that was accessible to the source VM.
> +	 */
> +	if (!zalloc_cpumask_var(&dst_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> +		ret = -ENOMEM;
> +		goto out_source_vcpu;
> +	}
> +
>   	sev_migrate_from(kvm, source_kvm);
>   	kvm_vm_dead(source_kvm);
>   	cg_cleanup_sev = src_sev;
> @@ -2707,7 +2740,7 @@ int sev_mem_enc_unregister_region(struct kvm *kvm,
>   		goto failed;
>   	}
>   
> -	sev_writeback_caches();
> +	sev_writeback_caches(kvm);
>   
>   	__unregister_enc_region_locked(kvm, region);
>   
> @@ -2749,13 +2782,18 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd)
>   		goto e_unlock;
>   	}
>   
> +	mirror_sev = to_kvm_sev_info(kvm);
> +	if (!zalloc_cpumask_var(&mirror_sev->have_run_cpus, GFP_KERNEL_ACCOUNT)) {
> +		ret = -ENOMEM;
> +		goto e_unlock;
> +	}
> +
>   	/*
>   	 * The mirror kvm holds an enc_context_owner ref so its asid can't
>   	 * disappear until we're done with it
>   	 */
>   	source_sev = to_kvm_sev_info(source_kvm);
>   	kvm_get_kvm(source_kvm);
> -	mirror_sev = to_kvm_sev_info(kvm);
>   	list_add_tail(&mirror_sev->mirror_entry, &source_sev->mirror_vms);
>   
>   	/* Set enc_context_owner and copy its encryption context over */
> @@ -2817,7 +2855,13 @@ void sev_vm_destroy(struct kvm *kvm)
>   
>   	WARN_ON(!list_empty(&sev->mirror_vms));
>   
> -	/* If this is a mirror_kvm release the enc_context_owner and skip sev cleanup */
> +	free_cpumask_var(sev->have_run_cpus);
> +
> +	/*
> +	 * If this is a mirror VM, remove it from the owner's list of a mirrors
> +	 * and skip ASID cleanup (the ASID is tied to the lifetime of the owner).
> +	 * Note, mirror VMs don't support registering encrypted regions.
> +	 */
>   	if (is_mirroring_enc_context(kvm)) {
>   		struct kvm *owner_kvm = sev->enc_context_owner;
>   
> @@ -3106,7 +3150,7 @@ static void sev_flush_encrypted_page(struct kvm_vcpu *vcpu, void *va)
>   	return;
>   
>   do_sev_writeback_caches:
> -	sev_writeback_caches();
> +	sev_writeback_caches(vcpu->kvm);
>   }
>   
>   void sev_guest_memory_reclaimed(struct kvm *kvm)
> @@ -3119,7 +3163,7 @@ void sev_guest_memory_reclaimed(struct kvm *kvm)
>   	if (!sev_guest(kvm) || sev_snp_guest(kvm))
>   		return;
>   
> -	sev_writeback_caches();
> +	sev_writeback_caches(kvm);
>   }
>   
>   void sev_free_vcpu(struct kvm_vcpu *vcpu)
> @@ -3451,6 +3495,15 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu)
>   	if (sev_es_guest(kvm) && !VALID_PAGE(svm->vmcb->control.vmsa_pa))
>   		return -EINVAL;
>   
> +	/*
> +	 * To optimize cache flushes when memory is reclaimed from an SEV VM,
> +	 * track physical CPUs that enter the guest for SEV VMs and thus can
> +	 * have encrypted, dirty data in the cache, and flush caches only for
> +	 * CPUs that have entered the guest.
> +	 */
> +	if (!cpumask_test_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus))
> +		cpumask_set_cpu(cpu, to_kvm_sev_info(kvm)->have_run_cpus);
> +
>   	/* Assign the asid allocated with this SEV guest */
>   	svm->asid = asid;
>   
> diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
> index e6f3c6a153a0..a7c6f07260cf 100644
> --- a/arch/x86/kvm/svm/svm.h
> +++ b/arch/x86/kvm/svm/svm.h
> @@ -113,6 +113,7 @@ struct kvm_sev_info {
>   	void *guest_req_buf;    /* Bounce buffer for SNP Guest Request input */
>   	void *guest_resp_buf;   /* Bounce buffer for SNP Guest Request output */
>   	struct mutex guest_req_mutex; /* Must acquire before using bounce buffers */
> +	cpumask_var_t have_run_cpus; /* CPUs that have done VMRUN for this VM. */
>   };
>   
>   #define SEV_POLICY_NODBG	BIT_ULL(0)
> 
> base-commit: a77896eea33db6fe393d1db1380e2e52f74546a2
> --