linux-kernel - Re: [PATCH v2] KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <2476ccf6-1542-c7ad-993b-7d1f703d390e@redhat.com>
Date:   Fri, 13 Aug 2021 09:36:28 +0200
From:   Paolo Bonzini <pbonzini@...hat.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Ben Gardon <bgardon@...gle.com>
Subject: Re: [PATCH v2] KVM: x86/mmu: Protect marking SPs unsync when using
 TDP MMU with spinlock

On 12/08/21 20:18, Sean Christopherson wrote:
> Add yet another spinlock for the TDP MMU and take it when marking indirect
> shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
> nested TDP, KVM may encounter shadow pages for the TDP entries managed by
> L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
> is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
> misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
> which runs with mmu_lock held for read, not write.
> 
> Lack of a critical section manifests most visibly as an underflow of
> unsync_children in clear_unsync_child_bit() due to unsync_children being
> corrupted when multiple CPUs write it without a critical section and
> without atomic operations.  But underflow is the best case scenario.  The
> worst case scenario is that unsync_children prematurely hits '0' and
> leads to guest memory corruption due to KVM neglecting to properly sync
> shadow pages.
> 
> Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
> would functionally be ok.  Usurping the lock could degrade performance when
> building upper level page tables on different vCPUs, especially since the
> unsync flow could hold the lock for a comparatively long time depending on
> the number of indirect shadow pages and the depth of the paging tree.
> 
> For simplicity, take the lock for all MMUs, even though KVM could fairly
> easily know that mmu_lock is held for write.  If mmu_lock is held for
> write, there cannot be contention for the inner spinlock, and marking
> shadow pages unsync across multiple vCPUs will be slow enough that
> bouncing the kvm_arch cacheline should be in the noise.
> 
> Note, even though L2 could theoretically be given access to its own EPT
> entries, a nested MMU must hold mmu_lock for write and thus cannot race
> against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
> be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
> that is running with the TDP MMU enabled.  Holding mmu_lock for read also
> prevents the indirect shadow page from being freed.  But as above, keep
> it simple and always take the lock.
> 
> Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
> effectively disable unsync behavior for nested TDP.  Write protecting leaf
> shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
> VMMs typically don't modify TDP entries, but the same may not hold true for
> non-standard use cases and/or VMMs that are migrating physical pages (from
> L1's perspective).
> 
> Alternative #2, the unsync logic could be made thread safe.  In theory,
> simply converting all relevant kvm_mmu_page fields to atomics and using
> atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
> would be required, (b) the code churn would be substantial, and (c) legacy
> shadow paging would incur additional atomic operations in performance
> sensitive paths for no benefit (to legacy shadow paging).
> 
> Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
> Cc: stable@...r.kernel.org
> Cc: Ben Gardon <bgardon@...gle.com>
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>   Documentation/virt/kvm/locking.rst |  8 ++++----
>   arch/x86/include/asm/kvm_host.h    |  7 +++++++
>   arch/x86/kvm/mmu/mmu.c             | 28 ++++++++++++++++++++++++++++
>   3 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
> index 8138201efb09..5d27da356836 100644
> --- a/Documentation/virt/kvm/locking.rst
> +++ b/Documentation/virt/kvm/locking.rst
> @@ -31,10 +31,10 @@ On x86:
>   
>   - vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock
>   
> -- kvm->arch.mmu_lock is an rwlock.  kvm->arch.tdp_mmu_pages_lock is
> -  taken inside kvm->arch.mmu_lock, and cannot be taken without already
> -  holding kvm->arch.mmu_lock (typically with ``read_lock``, otherwise
> -  there's no need to take kvm->arch.tdp_mmu_pages_lock at all).
> +- kvm->arch.mmu_lock is an rwlock.  kvm->arch.tdp_mmu_pages_lock and
> +  kvm->arch.mmu_unsync_pages_lock are taken inside kvm->arch.mmu_lock, and
> +  cannot be taken without already holding kvm->arch.mmu_lock (typically with
> +  ``read_lock`` for the TDP MMU, thus the need for additional spinlocks).
>   
>   Everything else is a leaf: no other lock is taken inside the critical
>   sections.
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 20daaf67a5bf..cf32b87b6bd3 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1036,6 +1036,13 @@ struct kvm_arch {
>   	struct list_head lpage_disallowed_mmu_pages;
>   	struct kvm_page_track_notifier_node mmu_sp_tracker;
>   	struct kvm_page_track_notifier_head track_notifier_head;
> +	/*
> +	 * Protects marking pages unsync during page faults, as TDP MMU page
> +	 * faults only take mmu_lock for read.  For simplicity, the unsync
> +	 * pages lock is always taken when marking pages unsync regardless of
> +	 * whether mmu_lock is held for read or write.
> +	 */
> +	spinlock_t mmu_unsync_pages_lock;
>   
>   	struct list_head assigned_dev_head;
>   	struct iommu_domain *iommu_domain;
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a272ccbddfa1..cef526dac730 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2596,6 +2596,7 @@ static void kvm_unsync_page(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>   int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
>   {
>   	struct kvm_mmu_page *sp;
> +	bool locked = false;
>   
>   	/*
>   	 * Force write-protection if the page is being tracked.  Note, the page
> @@ -2618,9 +2619,34 @@ int mmu_try_to_unsync_pages(struct kvm_vcpu *vcpu, gfn_t gfn, bool can_unsync)
>   		if (sp->unsync)
>   			continue;
>   
> +		/*
> +		 * TDP MMU page faults require an additional spinlock as they
> +		 * run with mmu_lock held for read, not write, and the unsync
> +		 * logic is not thread safe.  Take the spinklock regardless of
> +		 * the MMU type to avoid extra conditionals/parameters, there's
> +		 * no meaningful penalty if mmu_lock is held for write.
> +		 */
> +		if (!locked) {
> +			locked = true;
> +			spin_lock(&vcpu->kvm->arch.mmu_unsync_pages_lock);
> +
> +			/*
> +			 * Recheck after taking the spinlock, a different vCPU
> +			 * may have since marked the page unsync.  A false
> +			 * positive on the unprotected check above is not
> +			 * possible as clearing sp->unsync _must_ hold mmu_lock
> +			 * for write, i.e. unsync cannot transition from 0->1
> +			 * while this CPU holds mmu_lock for read (or write).
> +			 */
> +			if (READ_ONCE(sp->unsync))
> +				continue;
> +		}
> +
>   		WARN_ON(sp->role.level != PG_LEVEL_4K);
>   		kvm_unsync_page(vcpu, sp);
>   	}
> +	if (locked)
> +		spin_unlock(&vcpu->kvm->arch.mmu_unsync_pages_lock);
>   
>   	/*
>   	 * We need to ensure that the marking of unsync pages is visible
> @@ -5604,6 +5630,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
>   {
>   	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
>   
> +	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
> +
>   	if (!kvm_mmu_init_tdp_mmu(kvm))
>   		/*
>   		 * No smp_load/store wrappers needed here as we are in
> 

Queued, thanks.

Paolo