linux-kernel - Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP MMU under MMU read lock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZsPE56MnelsV490m@google.com>
Date: Mon, 19 Aug 2024 15:19:19 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Vipin Sharma <vipinsh@...gle.com>
Cc: pbonzini@...hat.com, dmatlack@...gle.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] KVM: x86/mmu: Recover NX Huge pages belonging to TDP
 MMU under MMU read lock

On Mon, Aug 12, 2024, Vipin Sharma wrote:
> Use MMU read lock to recover NX huge pages belonging to TDP MMU. Iterate
> through kvm->arch.possible_nx_huge_pages while holding
> kvm->arch.tdp_mmu_pages_lock. Rename kvm_tdp_mmu_zap_sp() to
> tdp_mmu_zap_sp() and make it static as there are no callers outside of
> TDP MMU.
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 933bb8b11c9f..7c7d207ee590 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -817,9 +817,11 @@ static void tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  	rcu_read_unlock();
>  }
>  
> -bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> +static bool tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)

At this point, I think we should rename this to tdp_mmu_zap_possible_nx_huge_page(),
as I can't imagine there's another use case where we'll zap a SP starting from the
SP itself, i.e. without first walking from the root.

>  {
> -	u64 old_spte;
> +	struct tdp_iter iter = {};

Rather than initializes the on-stack structure, I think it makes sense to directly
initialize the whole thing and then WARN after, e.g. so that its easier to see
that "iter" is simply being filled from @sp.

	struct tdp_iter iter = {
		.old_spte = sp->ptep ? kvm_tdp_mmu_read_spte(sp->ptep) : 0,
		.sptep = sp->ptep,
		.level = sp->role.level + 1,
		.gfn = sp->gfn,
		.as_id = kvm_mmu_page_as_id(sp),
	};

	lockdep_assert_held_read(&kvm->mmu_lock);

	/*
	 * Root shadow pages don't a parent page table and thus no associated
	 * entry, but they can never be possible NX huge pages.
	 */
	if (WARN_ON_ONCE(!sp->ptep))
		return false;

> +
> +	lockdep_assert_held_read(&kvm->mmu_lock);
>  
>  	/*
>  	 * This helper intentionally doesn't allow zapping a root shadow page,
> @@ -828,12 +830,25 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  	if (WARN_ON_ONCE(!sp->ptep))
>  		return false;
>  
> -	old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> -	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
> +	iter.old_spte = kvm_tdp_mmu_read_spte(sp->ptep);
> +	iter.sptep = sp->ptep;
> +	iter.level = sp->role.level + 1;
> +	iter.gfn = sp->gfn;
> +	iter.as_id = kvm_mmu_page_as_id(sp);
> +
> +retry:
> +	/*
> +	 * Since mmu_lock is held in read mode, it's possible to race with
> +	 * another CPU which can remove sp from the page table hierarchy.
> +	 *
> +	 * No need to re-read iter.old_spte as tdp_mmu_set_spte_atomic() will
> +	 * update it in the case of failure.
> +	 */
> +	if (sp->spt != spte_to_child_pt(iter.old_spte, iter.level))
>  		return false;
>  
> -	tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> -			 SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1);
> +	if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
> +		goto retry;

I'm pretty sure there's no need to retry.  Non-leaf SPTEs don't have Dirty bits,
and KVM always sets the Accessed bit (and never clears it) for non-leaf SPTEs.
Ditty for the Writable bit.

So unless I'm missing something, the only way for the CMPXCHG to fail is if the
SPTE was zapped or replaced with something else, in which case the sp->spt will
fail.  I would much prefer to WARN on that logic failing than have what appears
to be a potential infinite loop.