lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Thu, 22 Jul 2021 19:06:44 +0000
From:   Sean Christopherson <>
To:     Maxim Levitsky <>
        "open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)" 
        <>, Jim Mattson <>,
        Joerg Roedel <>, Borislav Petkov <>,
        Vitaly Kuznetsov <>,
        Wanpeng Li <>,
        Paolo Bonzini <>,
        Thomas Gleixner <>,
        "H. Peter Anvin" <>, Ingo Molnar <>,
        "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" <>,
        Ben Gardon <>
Subject: Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when
 AutoEOI feature is in use


On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > I am more inclined to fix this by just tracking if we hold the srcu
> > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > and then kvm_request_apicv_update can use this to drop the srcu
> > > lock when needed.
> > 
> > The entire approach of dynamically adding/removing the memslot seems doomed to
> > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > and again on re-enable.
> > 
> > Rather than pile on more gunk, what about special casing the APIC access page
> > memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
> > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
> > also give us a good excuse to rename try_async_pf() :-)
> > 
> > If lack of MMIO caching is a performance problem, an alternative solution would
> > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > clear their cache.
> > 
> > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > less awful than the current memslot+SRCU mess.
> Hi!
> I am testing your approach and it actually works very well! I can't seem to break it.
> Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?

Glad you asked, there's one more change needed.  kvm_zap_gfn_range() currently
takes mmu_lock for read, but it needs to take mmu_lock for write for this case
(more way below).

The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy.  The
whole thing is a grey area because KVM is trying to ensure it honors the guest's
UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU,
i.e. for it to work correctly, the guest has to ensure all running vCPUs do the
same transition.  So in practice there's likely no observable bug, but it also
means that taking mmu_lock for read is likely pointless, because for things to
work the guest has to serialize all running vCPUs.

Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()?  It would
effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to
operate under the mmu read lock"); see attached patch.  And we could even bump
the notifier count in that helper, e.g. on top of the attached:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b607e8763aa2..7174058e982b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)


+       kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
        if (kvm_memslots_have_rmaps(kvm)) {
                for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
                        slots = __kvm_memslots(kvm, i);
@@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
        if (flush)
                kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);

+       kvm_dec_notifier_count(kvm, gfn_start, gfn_end);

Back to Maxim's original question...

Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case
where APICv is being disabled while a different vCPU is concurrently faulting in a
new mapping for the APIC page.  E.g. it handles this race:

 vCPU0                                 vCPU1
                                       apic_access_memslot_enabled = true;
 			               #NPF on APIC
			               apic_access_memslot_enabled==true, proceed with #NPF
 apic_access_memslot_enabled = false 

 mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate

The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
to bail and resume the guest without fixing the #NPF.  After acquiring mmu_lock,
vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
to be called, or just finised) and/or a modified mmu_notifier_seq (after the
count was decremented).

This is why kvm_zap_gfn_range() needs to take mmu_lock for write.  If it's allowed
to run in parallel with the page fault handler, there's no guarantee that the
correct apic_access_memslot_enabled will be observed.

	if (is_tdp_mmu_fault)

	if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here!
		goto out_unlock;

	if (is_tdp_mmu_fault)
		r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
				    pfn, prefault);
		r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
				 prefault, is_tdp);

View attachment "0001-Revert-KVM-x86-mmu-Allow-zap-gfn-range-to-operate-un.patch" of type "text/x-diff" (4674 bytes)

Powered by blists - more mailing lists