lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZtDU3x9t66BnpPt_@google.com>
Date: Thu, 29 Aug 2024 13:06:55 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Vipin Sharma <vipinsh@...gle.com>
Cc: pbonzini@...hat.com, dmatlack@...gle.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 4/4] KVM: x86/mmu: Recover TDP MMU NX huge pages using
 MMU read lock

On Thu, Aug 29, 2024, Vipin Sharma wrote:
> @@ -1806,7 +1832,7 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  	struct kvm_mmu_page *sp;
>  	bool flush = false;
>  
> -	lockdep_assert_held_write(&kvm->mmu_lock);
> +	lockdep_assert_held_read(&kvm->mmu_lock);
>  	/*
>  	 * Zapping TDP MMU shadow pages, including the remote TLB flush, must
>  	 * be done under RCU protection, because the pages are freed via RCU
> @@ -1815,8 +1841,11 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  	rcu_read_lock();
>  
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages))
> +		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +		if (list_empty(&kvm->arch.tdp_mmu_possible_nx_huge_pages)) {
> +			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  			break;
> +		}
>  
>  		/*
>  		 * We use a separate list instead of just using active_mmu_pages
> @@ -1832,16 +1861,29 @@ void kvm_tdp_mmu_recover_nx_huge_pages(struct kvm *kvm)
>  		WARN_ON_ONCE(!sp->role.direct);
>  
>  		/*
> -		 * Unaccount and do not attempt to recover any NX Huge Pages
> -		 * that are being dirty tracked, as they would just be faulted
> -		 * back in as 4KiB pages. The NX Huge Pages in this slot will be
> +		 * Unaccount the shadow page before zapping its SPTE so as to
> +		 * avoid bouncing tdp_mmu_pages_lock more than is necessary.
> +		 * Clearing nx_huge_page_disallowed before zapping is safe, as
> +		 * the flag doesn't protect against iTLB multi-hit, it's there
> +		 * purely to prevent bouncing the gfn between an NX huge page
> +		 * and an X small spage. A vCPU could get stuck tearing down
> +		 * the shadow page, e.g. if it happens to fault on the region
> +		 * before the SPTE is zapped and replaces the shadow page with
> +		 * an NX huge page and get stuck tearing down the child SPTEs,
> +		 * but that is a rare race, i.e. shouldn't impact performance.
> +		 */
> +		unaccount_nx_huge_page(kvm, sp);
> +		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +
> +		/*
> +		 * Don't bother zapping shadow pages if the memslot is being
> +		 * dirty logged, as the relevant pages would just be faulted back
> +		 * in as 4KiB pages. Potential NX Huge Pages in this slot will be
>  		 * recovered, along with all the other huge pages in the slot,
>  		 * when dirty logging is disabled.
>  		 */
> -		if (kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
> -			unaccount_nx_huge_page(kvm, sp);
> -		else
> -			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
> +		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
> +			flush |= tdp_mmu_zap_possible_nx_huge_page(kvm, sp);
>  		WARN_ON_ONCE(sp->nx_huge_page_disallowed);
>  
>  		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {

Tsk, tsk.  There's a cond_resched_rwlock_write() lurking here.

Which is a decent argument/segue for my main comment: I would very, very strongly
prefer to have a single flow for the control logic.  Almost all of the code is
copy+pasted to the TDP MMU, and there unnecessary and confusing differences between
the two flows.  E.g. the TDP MMU does unaccount_nx_huge_page() preemptively, while
the shadow MMU does not.

The TDP MMU has a few extra "locks" (rcu_read_lock() + tdp_mmu_pages_lock), but
I don't see any harm in taking those in the shadow MMU flow.  KVM holds a spinlock
and so RCU practically is meaningless, and tdp_mmu_pages_lock will never be
contended since it's only ever taken under mmu_lock.

If we _really_ wanted to bury tdp_mmu_pages_lock in the TDP MMU, it could be
passed in as an optional paramter.  I'm not super opposed to that, it just looks
ugly, and I'm not convinced it's worth the effort.  Maybe a middle ground would
be to provide a helper so that tdp_mmu_pages_lock can stay 64-bit only, but this
code doesn't need #ifdefs?

Anyways, if the helper is __always_inline, there shouldn't be an indirect call
to recover_page().  Completely untested, but this is the basic gist.

---
typedef bool (*nx_recover_page_t)(struct kvm *kvm, struct kvm_mmu_page *sp,
				  struct list_head *invalid_pages);

static __always_inline void kvm_recover_nx_huge_pages(struct kvm *kvm,
						      bool shared,
						      spinlock_t *list_lock;
						      struct list_head *pages,
						      unsigned long nr_pages,
						      nx_recover_page_t recover_page)
{
	unsigned int ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
	unsigned long to_zap = ratio ? DIV_ROUND_UP(nr_pages, ratio) : 0;
	struct kvm_mmu_page *sp;
	LIST_HEAD(invalid_list);
	int rcu_idx;
	bool flush;

	kvm_lockdep_assert_mmu_lock_held(kvm, shared);

	rcu_idx = srcu_read_lock(&kvm->srcu);
	rcu_read_lock();

	for ( ; to_zap; --to_zap) {
		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
		if (list_empty(pages)) {
			spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
			break;
		}

		sp = list_first_entry(pages, struct kvm_mmu_page,
				      possible_nx_huge_page_link);
		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
		WARN_ON_ONCE(!sp->role.direct);

		unaccount_nx_huge_page(kvm, sp);
		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);

		if (!kvm_mmu_sp_dirty_logging_enabled(kvm, sp))
			flush |= recover_page(kvm, sp, &invalid_list);

		WARN_ON_ONCE(sp->nx_huge_page_disallowed);

		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
			kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

			rcu_read_unlock();

			if (shared)
				cond_resched_rwlock_read(&kvm->mmu_lock);
			else
				cond_resched_rwlock_write(&kvm->mmu_lock);

			rcu_read_lock();
		}
	}

	kvm_mmu_remote_flush_or_zap(kvm, &invalid_list, flush);

	rcu_read_unlock();
	srcu_read_unlock(&kvm->srcu, rcu_idx);
}

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ