[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADrL8HW46uQQKYUngYwomzfKWB0Vf4nG1WRjZu84hiXxtHN14Q@mail.gmail.com>
Date: Mon, 28 Jul 2025 11:07:37 -0700
From: James Houghton <jthoughton@...gle.com>
To: Sean Christopherson <seanjc@...gle.com>
Cc: Paolo Bonzini <pbonzini@...hat.com>, Vipin Sharma <vipinsh@...gle.com>,
David Matlack <dmatlack@...gle.com>, kvm@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 3/7] KVM: x86/mmu: Recover TDP MMU NX huge pages using
MMU read lock
On Wed, Jul 23, 2025 at 1:35 PM Sean Christopherson <seanjc@...gle.com> wrote:
>
> On Mon, Jul 07, 2025, James Houghton wrote:
> > From: Vipin Sharma <vipinsh@...gle.com>
> >
> > Use MMU read lock to recover TDP MMU NX huge pages. Iterate
>
> Wrap at ~75 chars.
Oh, I did indeed wrap the text pretty aggressively for this patch. Not
sure why that happened.
>
> > over the huge pages list under tdp_mmu_pages_lock protection and
> > unaccount the page before dropping the lock.
> >
> > We must not zap an SPTE if:
>
> No pronouns!
Right.
>
> > - The SPTE is a root page.
> > - The SPTE does not point at the SP's page table.
> >
> > If the SPTE does not point at the SP's page table, then something else
> > has change the SPTE, so we cannot safely zap it.
> >
> > Warn if zapping SPTE fails and current SPTE is still pointing to same
> > page table. This should never happen.
> >
> > There is always a race between dirty logging, vCPU faults, and NX huge
> > page recovery for backing a gfn by an NX huge page or an executable
> > small page. Unaccounting sooner during the list traversal is increasing
> > the window of that race. Functionally, it is okay, because accounting
> > doesn't protect against iTLB multi-hit bug, it is there purely to
> > prevent KVM from bouncing a gfn between two page sizes. The only
> > downside is that a vCPU will end up doing more work in tearing down all
> > the child SPTEs. This should be a very rare race.
> >
> > Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> > read lock. This optimizaion is done to solve a guest jitter issue on
> > Windows VM which was observing an increase in network latency.
>
> With slight tweaking:
>
> Use MMU read lock to recover TDP MMU NX huge pages. To prevent
> concurrent modification of the list of potential huge pages, iterate over
> the list under tdp_mmu_pages_lock protection and unaccount the page
> before dropping the lock.
>
> Zapping under MMU read lock unblocks vCPUs which are waiting for MMU
> read lock, which solves a guest jitter issue on Windows VMs which were
> observing an increase in network latency.
>
> Do not zap an SPTE if:
> - The SPTE is a root page.
> - The SPTE does not point at the SP's page table.
>
> If the SPTE does not point at the SP's page table, then something else
> has change the SPTE, so KVM cannot safely zap it.
"has changed" (my mistake)
>
> Warn if zapping SPTE fails and current SPTE is still pointing to same
> page table, as it should be impossible for the CMPXCHG to fail due to all
> other write scenarios being mutually exclusive.
>
> There is always a race between dirty logging, vCPU faults, and NX huge
> page recovery for backing a gfn by an NX huge page or an executable
> small page. Unaccounting sooner during the list traversal increases the
> window of that race, but functionally, it is okay. Accounting doesn't
> protect against iTLB multi-hit bug, it is there purely to prevent KVM
> from bouncing a gfn between two page sizes. The only downside is that a
> vCPU will end up doing more work in tearing down all the child SPTEs.
> This should be a very rare race.
Thanks, this is much better.
>
> > Suggested-by: Sean Christopherson <seanjc@...gle.com>
> > Signed-off-by: Vipin Sharma <vipinsh@...gle.com>
> > Co-developed-by: James Houghton <jthoughton@...gle.com>
> > Signed-off-by: James Houghton <jthoughton@...gle.com>
> > ---
> > arch/x86/kvm/mmu/mmu.c | 107 ++++++++++++++++++++++++-------------
> > arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++---
> > 2 files changed, 105 insertions(+), 44 deletions(-)
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b074f7bb5cc58..7df1b4ead705b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -7535,12 +7535,40 @@ static unsigned long nx_huge_pages_to_zap(struct kvm *kvm,
> > return ratio ? DIV_ROUND_UP(pages, ratio) : 0;
> > }
> >
> > +static bool kvm_mmu_sp_dirty_logging_enabled(struct kvm *kvm,
> > + struct kvm_mmu_page *sp)
> > +{
> > + struct kvm_memory_slot *slot = NULL;
> > +
> > + /*
> > + * Since gfn_to_memslot() is relatively expensive, it helps to skip it if
> > + * it the test cannot possibly return true. On the other hand, if any
> > + * memslot has logging enabled, chances are good that all of them do, in
> > + * which case unaccount_nx_huge_page() is much cheaper than zapping the
> > + * page.
>
> And largely irrelevant, because KVM should unaccount the NX no matter what. I
> kinda get what you're saying, but honestly it adds a lot of confusion, especially
> since unaccount_nx_huge_page() is in the caller.
>
> > + *
> > + * If a memslot update is in progress, reading an incorrect value of
> > + * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
> > + * zero, gfn_to_memslot() will be done unnecessarily; if it is becoming
> > + * nonzero, the page will be zapped unnecessarily. Either way, this only
> > + * affects efficiency in racy situations, and not correctness.
> > + */
> > + if (atomic_read(&kvm->nr_memslots_dirty_logging)) {
>
> Short-circuit the function to decrease indentation, and so that "slot" doesn't
> need to be NULL-initialized.
>
> > + struct kvm_memslots *slots;
> > +
> > + slots = kvm_memslots_for_spte_role(kvm, sp->role);
> > + slot = __gfn_to_memslot(slots, sp->gfn);
>
> Then this can be:
>
> slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
>
> without creating a stupid-long line.
>
> > + WARN_ON_ONCE(!slot);
>
> And then:
>
> if (WARN_ON_ONCE(!slot))
> return false;
>
> return kvm_slot_dirty_track_enabled(slot);
>
> With a comment cleanup:
>
> struct kvm_memory_slot *slot;
>
> /*
> * Skip the memslot lookup if dirty tracking can't possibly be enabled,
> * as memslot lookups are relatively expensive.
> *
> * If a memslot update is in progress, reading an incorrect value of
> * kvm->nr_memslots_dirty_logging is not a problem: if it is becoming
> * zero, KVM will do an unnecessary memslot lookup; if it is becoming
> * nonzero, the page will be zapped unnecessarily. Either way, this
> * only affects efficiency in racy situations, and not correctness.
> */
> if (!atomic_read(&kvm->nr_memslots_dirty_logging))
> return false;
>
> slot = __gfn_to_memslot(kvm_memslots_for_spte_role(kvm, sp->role), sp->gfn);
> if (WARN_ON_ONCE(!slot))
> return false;
>
> return kvm_slot_dirty_track_enabled(slot);
LGTM, thanks!
> > @@ -7559,8 +7590,17 @@ static void kvm_recover_nx_huge_pages(struct kvm *kvm,
> > rcu_read_lock();
> >
> > for ( ; to_zap; --to_zap) {
> > - if (list_empty(nx_huge_pages))
> > +#ifdef CONFIG_X86_64
>
> These #ifdefs still make me sad, but I also still think they're the least awful
> solution. And hopefully we will jettison 32-bit sooner than later :-)
Yeah I couldn't come up with anything better. :(
Powered by blists - more mailing lists