linux-kernel - Re: [RFC PATCH 06/24] KVM: SEV: Track ASID->vCPU instead of ASID->VMCB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFXrFKvZcJ3dN4k_@google.com>
Date: Fri, 20 Jun 2025 16:13:24 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Jim Mattson <jmattson@...gle.com>, 
	Maxim Levitsky <mlevitsk@...hat.com>, Vitaly Kuznetsov <vkuznets@...hat.com>, 
	Rik van Riel <riel@...riel.com>, Tom Lendacky <thomas.lendacky@....com>, x86@...nel.org, 
	kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH 06/24] KVM: SEV: Track ASID->vCPU instead of ASID->VMCB

On Wed, Mar 26, 2025, Yosry Ahmed wrote:
> SEV currently tracks the ASID to VMCB mapping for each physical CPU.
> This is required to flush the ASID when a new VMCB using the same ASID
> is run on the same CPU. Practically, there is a single VMCB for each
> vCPU using SEV. Furthermore, TLB flushes on nested transitions between
> VMCB01 and VMCB02 are handled separately (see
> nested_svm_transition_tlb_flush()).
> 
> In preparation for generalizing the tracking and making the tracking
> more expensive, start tracking the ASID to vCPU mapping instead. This
> will allow for the tracking to be moved to a cheaper code path when
> vCPUs are switched.
> 
> Signed-off-by: Yosry Ahmed <yosry.ahmed@...ux.dev>
> ---
>  arch/x86/kvm/svm/sev.c | 12 ++++++------
>  arch/x86/kvm/svm/svm.c |  2 +-
>  arch/x86/kvm/svm/svm.h |  4 ++--
>  3 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index d613f81addf1c..ddb4d5b211ed7 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -240,7 +240,7 @@ static void sev_asid_free(struct kvm_sev_info *sev)
>  
>  	for_each_possible_cpu(cpu) {
>  		sd = per_cpu_ptr(&svm_data, cpu);
> -		sd->sev_vmcbs[sev->asid] = NULL;
> +		sd->sev_vcpus[sev->asid] = NULL;
>  	}
>  
>  	mutex_unlock(&sev_bitmap_lock);
> @@ -3081,8 +3081,8 @@ int sev_cpu_init(struct svm_cpu_data *sd)
>  	if (!sev_enabled)
>  		return 0;
>  
> -	sd->sev_vmcbs = kcalloc(nr_asids, sizeof(void *), GFP_KERNEL);
> -	if (!sd->sev_vmcbs)
> +	sd->sev_vcpus = kcalloc(nr_asids, sizeof(void *), GFP_KERNEL);
> +	if (!sd->sev_vcpus)
>  		return -ENOMEM;
>  
>  	return 0;
> @@ -3471,14 +3471,14 @@ int pre_sev_run(struct vcpu_svm *svm, int cpu)
>  	/*
>  	 * Flush guest TLB:
>  	 *
> -	 * 1) when different VMCB for the same ASID is to be run on the same host CPU.
> +	 * 1) when different vCPU for the same ASID is to be run on the same host CPU.

Tom, can you clarity what an ASID actually tags when NPT is in use?

The more I think about all of this, the less it makes sense.  The *entire* point
of an ASID is to tag TLB entries so that a flush isn't required when running code
for the same address space.

The main problem I'm struggling with is that, as usual, the APM doesn't properly
document anything, and just gives "suggestions" for the VMM.  *sigh*

As I read it, these snippets from the APM are saying ASIDs tag only GPA=>PA entries
when NPT is in use.

  TLB entries are tagged with Address Space Identifier (ASID) bits to distinguish
  different guest virtual address spaces when shadow page tables are used, or
  different guest physical address spaces when nested page tables are used. The
  VMM can choose a software strategy in which it keeps multiple shadow page tables,
  and/or multiple nested page tables in processors that support nested paging,
  up-to-date; the VMM can allocate a different ASID for each shadow or nested
  page table. This allows switching to a new process in a guest under shadow
  paging (changing CR3 contents), or to a new guest under nested paging (changing
  nCR3 contents), without flushing the TLBs.

  Note that because an ASID is associated with the guest's physical address
  space, it is common across all of the guest's virtual address spaces within a
  processor. This differs from shadow page tables where ASIDs tag individual
  guest virtual address spaces. Note also that the same ASID may or may not be
  associated with the same address space across all processors in a
  multiprocessor system, for either nested tables or shadow tables; this depends
  on how the VMM manages ASID assignment.

But then the "15.16.1 TLB Flush" section says this, without any qualification
whatsoever that it applies only to shadow paging.

  A MOV-to-CR3, a task switch that changes CR3, or clearing or setting CR0.PG or
  bits PGE, PAE, PSE of CR4 affects only the TLB entries belonging to the current
  ASID, regardless of whether the operation occurred in host or guest mode. The
  current ASID is 0 when the CPU is not inside a guest context.

And honestly, tagging only GPA=>PA entries doesn't make any sense, because
GVA=>GPA needs to be tagged with *something*.  And the APM doesn't say anything
about caching GPA=>PA translations, only about caching VA=>PA.

The thing that doesn't fit is that SEV+ uses ASIDs on a per-VM basis.  I suggested
per-VM ASIDs for all VM types based solely on that fact, but now I'm wondering if
it's SEV+ that crazy and broken.  Because if ASIDs also tag GVA=>GPA, then SEV has
a massive architectural security hole, e.g. a malicious hypervisor can coerce the
CPU into using a stale GVA=>GPA TLB entry by switching vCPUs and letting guest
process with CR3=x access memory for guest process with CR3=y.  But again, if
ASIDs don't tag GVA=>GPA, then what provides isolation between vCPUs!?!?!

Assuming ASIDs tag VA=>PA (i.e. the combined GVA=>GPA=>PA translation), then I
take back my suggestion to use per-VM ASIDs, and instead propose we treat ASIDs
like VPIDs, i.e. use per-vCPU ASIDs (well, technically per-VMCB, because nested).
All server SKUs I've checked (Rome, Milan, Genoa, and Turin) support 0x8000 ASIDs.
That makes the ~512 ASIDs reserved for SEV+ a drop in the bucket, i.e. still leaves
~32k ASIDs up for grabs, which means KVM can concurrently run ~16k vCPUs *with*
nested VMs without having to fallback to flushing when scheduling in a new vCPU.

That way we don't need the complexity of the xarray ASID=>vCPU tracking for common
code.  And as a bonus, the logic for VMX vs. SVM is very, very similar.

Given the SEV+ uses the ASID as the handle for the guest's encryption key,
changing that behavior isn't an option.  Though I still don't see how that isn't
a security flaw.  Regardless, I definitely don't think it's something we should
follow.