linux-kernel - Re: [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to yielding

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cafdfc3d-5e4b-f791-82a3-d669ba2d6039@redhat.com>
Date:   Wed, 15 Dec 2021 18:13:54 +0100
From:   Paolo Bonzini <pbonzini@...hat.com>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org,
        Ignat Korchagin <ignat@...udflare.com>
Subject: Re: [PATCH] KVM: x86/mmu: Don't advance iterator after restart due to
 yielding

On 12/14/21 04:35, Sean Christopherson wrote:
> After dropping mmu_lock in the TDP MMU, restart the iterator during
> tdp_iter_next() and do not advance the iterator.  Advancing the iterator
> results in skipping the top-level SPTE and all its children, which is
> fatal if any of the skipped SPTEs were not visited before yielding.
> 
> When zapping all SPTEs, i.e. when min_level == root_level, restarting the
> iter and then invoking tdp_iter_next() is always fatal if the current gfn
> has as a valid SPTE, as advancing the iterator results in try_step_side()
> skipping the current gfn, which wasn't visited before yielding.
> 
> Sprinkle WARNs on iter->yielded being true in various helpers that are
> often used in conjunction with yielding, and tag the helper with
> __must_check to reduce the probabily of improper usage.
> 
> Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
> SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
> the shadow page will be leaked and KVM will WARN accordingly.
> 
>    WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
>    RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
>    Call Trace:
>     <TASK>
>     kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
>     kvm_destroy_vm+0x162/0x2a0 [kvm]
>     kvm_vcpu_release+0x34/0x60 [kvm]
>     __fput+0x82/0x240
>     task_work_run+0x5c/0x90
>     do_exit+0x364/0xa10
>     ? futex_unqueue+0x38/0x60
>     do_group_exit+0x33/0xa0
>     get_signal+0x155/0x850
>     arch_do_signal_or_restart+0xed/0x750
>     exit_to_user_mode_prepare+0xc5/0x120
>     syscall_exit_to_user_mode+0x1d/0x40
>     do_syscall_64+0x48/0xc0
>     entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
> kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
> marking a struct page as dirty/accessed after it has been put back on the
> free list.  This directly triggers a WARN due to encountering a page with
> page_count() == 0, but it can also lead to data corruption and additional
> errors in the kernel.
> 
>    WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
>    RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
>    Call Trace:
>     <TASK>
>     kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
>     __handle_changed_spte+0x92e/0xca0 [kvm]
>     __handle_changed_spte+0x63c/0xca0 [kvm]
>     __handle_changed_spte+0x63c/0xca0 [kvm]
>     __handle_changed_spte+0x63c/0xca0 [kvm]
>     zap_gfn_range+0x549/0x620 [kvm]
>     kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
>     mmu_free_root_page+0x219/0x2c0 [kvm]
>     kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
>     kvm_mmu_unload+0x1c/0xa0 [kvm]
>     kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
>     kvm_put_kvm+0x3b1/0x8b0 [kvm]
>     kvm_vcpu_release+0x4e/0x70 [kvm]
>     __fput+0x1f7/0x8c0
>     task_work_run+0xf8/0x1a0
>     do_exit+0x97b/0x2230
>     do_group_exit+0xda/0x2a0
>     get_signal+0x3be/0x1e50
>     arch_do_signal_or_restart+0x244/0x17f0
>     exit_to_user_mode_prepare+0xcb/0x120
>     syscall_exit_to_user_mode+0x1d/0x40
>     do_syscall_64+0x4d/0x90
>     entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM:
> x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
> tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
> incorrectly advance past a top-level entry when yielding on a lower-level
> entry.  But with respect to leaking shadow pages, the bug was introduced
> by yielding before processing the current gfn.
> 
> Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
> callers could jump to their "retry" label.  The downside of that approach
> is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
> in the loop, and there's no easy way to enfornce that requirement.
> 
> Ideally, KVM would handling the cond_resched() fully within the iterator
> macro (the code is actually quite clean) and avoid this entire class of
> bugs, but that is extremely difficult do while also supporting yielding
> after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
> SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
> bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
> may block operations on the SPTE for a significant amount of time.
> 
> Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
> Fixes: 1af4a96025b3 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
> Reported-by: Ignat Korchagin <ignat@...udflare.com>
> Cc: stable@...r.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>   arch/x86/kvm/mmu/tdp_iter.c |  6 ++++++
>   arch/x86/kvm/mmu/tdp_iter.h |  6 ++++++
>   arch/x86/kvm/mmu/tdp_mmu.c  | 29 ++++++++++++++++-------------
>   3 files changed, 28 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_iter.c b/arch/x86/kvm/mmu/tdp_iter.c
> index b3ed302c1a35..caa96c270b95 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.c
> +++ b/arch/x86/kvm/mmu/tdp_iter.c
> @@ -26,6 +26,7 @@ static gfn_t round_gfn_for_level(gfn_t gfn, int level)
>    */
>   void tdp_iter_restart(struct tdp_iter *iter)
>   {
> +	iter->yielded = false;
>   	iter->yielded_gfn = iter->next_last_level_gfn;
>   	iter->level = iter->root_level;
>   
> @@ -160,6 +161,11 @@ static bool try_step_up(struct tdp_iter *iter)
>    */
>   void tdp_iter_next(struct tdp_iter *iter)
>   {
> +	if (iter->yielded) {
> +		tdp_iter_restart(iter);
> +		return;
> +	}
> +
>   	if (try_step_down(iter))
>   		return;
>   
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index b1748b988d3a..e19cabbcb65c 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -45,6 +45,12 @@ struct tdp_iter {
>   	 * iterator walks off the end of the paging structure.
>   	 */
>   	bool valid;
> +	/*
> +	 * True if KVM dropped mmu_lock and yielded in the middle of a walk, in
> +	 * which case tdp_iter_next() needs to restart the walk at the root
> +	 * level instead of advancing to the next entry.
> +	 */
> +	bool yielded;
>   };
>   
>   /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b69e47e68307..7b1bc816b7c3 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -502,6 +502,8 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>   					   struct tdp_iter *iter,
>   					   u64 new_spte)
>   {
> +	WARN_ON_ONCE(iter->yielded);
> +
>   	lockdep_assert_held_read(&kvm->mmu_lock);
>   
>   	/*
> @@ -575,6 +577,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>   				      u64 new_spte, bool record_acc_track,
>   				      bool record_dirty_log)
>   {
> +	WARN_ON_ONCE(iter->yielded);
> +
>   	lockdep_assert_held_write(&kvm->mmu_lock);
>   
>   	/*
> @@ -640,18 +644,19 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>    * If this function should yield and flush is set, it will perform a remote
>    * TLB flush before yielding.
>    *
> - * If this function yields, it will also reset the tdp_iter's walk over the
> - * paging structure and the calling function should skip to the next
> - * iteration to allow the iterator to continue its traversal from the
> - * paging structure root.
> + * If this function yields, iter->yielded is set and the caller must skip to
> + * the next iteration, where tdp_iter_next() will reset the tdp_iter's walk
> + * over the paging structures to allow the iterator to continue its traversal
> + * from the paging structure root.
>    *
> - * Return true if this function yielded and the iterator's traversal was reset.
> - * Return false if a yield was not needed.
> + * Returns true if this function yielded.
>    */
> -static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
> -					     struct tdp_iter *iter, bool flush,
> -					     bool shared)
> +static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
> +							  struct tdp_iter *iter,
> +							  bool flush, bool shared)
>   {
> +	WARN_ON(iter->yielded);
> +
>   	/* Ensure forward progress has been made before yielding. */
>   	if (iter->next_last_level_gfn == iter->yielded_gfn)
>   		return false;
> @@ -671,12 +676,10 @@ static inline bool tdp_mmu_iter_cond_resched(struct kvm *kvm,
>   
>   		WARN_ON(iter->gfn > iter->next_last_level_gfn);
>   
> -		tdp_iter_restart(iter);
> -
> -		return true;
> +		iter->yielded = true;
>   	}
>   
> -	return false;
> +	return iter->yielded;
>   }
>   
>   /*
> 

Queued, thnaks.

Paolo