linux-kernel - Re: [PATCH 0/2] KVM: x86: Fix and cleanup for recent AVIC changes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YWmpKTk/7MOCzm15@google.com>
Date:   Fri, 15 Oct 2021 16:15:37 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Maxim Levitsky <mlevitsk@...hat.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/2] KVM: x86: Fix and cleanup for recent AVIC changes

On Tue, Oct 12, 2021, Maxim Levitsky wrote:
> On Mon, 2021-10-11 at 16:58 +0000, Sean Christopherson wrote:
> > Argh, I forgot the memslot is still there, so the access won't be treated as MMIO
> > and thus won't end up in the MMIO cache.
> > 
> > So I agree that the code is functionally ok, but I'd still prefer to switch to
> > kvm_vcpu_apicv_active() so that this code is coherent with respect to the APICv
> > status at the time the fault occurred.
> > 
> > My objection to using kvm_apicv_activated() is that the result is completely
> > non-deterministic with respect to the vCPU's APICv status at the time of the
> > fault.  It works because all of the other mechanisms that are in place, e.g.
> > elevating the MMU notifier count, but the fact that the result is non-deterministic
> > means that using the per-vCPU status is also functionally ok.
> 
> The problem is that it is just not correct to use local AVIC enable state 
> to determine if we want to populate the SPTE or or just jump to the emulation.
> 
> 
> For example, assuming that the AVIC is now enabled on all vCPUs,
> we can have this scenario:
> 
>     vCPU0                                   vCPU1
>     =====                                   =====
> 
> - disable AVIC
> - VMRUN
>                                         - #NPT on AVIC MMIO access
>                                         - *stuck on something prior to the page fault code*
> - enable AVIC
> - VMRUN
>                                         - *still stuck on something prior to the page fault code*
> 
> - disable AVIC:
> 
>   - raise KVM_REQ_APICV_UPDATE request
> 					
>   - set global avic state to disable
> 
>   - zap the SPTE (does nothing, doesn't race
> 	with anything either)
> 
>   - handle KVM_REQ_APICV_UPDATE -
>     - disable vCPU0 AVIC
> 
> - VMRUN
> 					- *still stuck on something prior to the page fault code*
> 
>                                                             ...
>                                                             ...
>                                                             ...
> 
>                                         - now vCPU1 finally starts running the page fault code.
> 
>                                         - vCPU1 AVIC is still enabled 
>                                           (because vCPU1 never handled KVM_REQ_APICV_UPDATE),
>                                           so the page fault code will populate the SPTE.

But vCPU1 won't install the SPTE if it loses the race to acquire mmu_lock, because
kvm_zap_gfn_range() bumps the notifier sequence and so vCPU1 will retry the fault.
If vCPU1 wins the race, i.e. sees the same sequence number, then the zap is
guaranteed to find the newly-installed SPTE.

And IMO, retrying is the desired behavior.  Installing a SPTE based on the global
state works, but it's all kinds of weird to knowingly take an action the directly
contradicts the current vCPU state.

FWIW, I had gone so far as to type this up to handle the situation you described
before remembering the sequence interaction.

		/*
		 * If the APIC access page exists but is disabled, go directly
		 * to emulation without caching the MMIO access or creating a
		 * MMIO SPTE.  That way the cache doesn't need to be purged
		 * when the AVIC is re-enabled.
		 */
		if (slot && slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) {
			/*
			 * Retry the fault if an APICv update is pending, as
			 * the kvm_zap_gfn_range() when APICv becomes inhibited
			 * may have already occurred, in which case installing
			 * a SPTE would be incorrect.
			 */
			if (!kvm_vcpu_apicv_active(vcpu)) {
				*r = RET_PF_EMULATE;
				return true;
			} else if (kvm_test_request(KVM_REQ_APICV_UPDATE, vcpu)) {
				*r = RET_PF_RETRY;
				return true;
			}
		}

>                                         - handle KVM_REQ_APICV_UPDATE
>                                            - finally disable vCPU1 AVIC
> 
>                                         - VMRUN (vCPU1 AVIC disabled, SPTE populated)
> 
> 					                 ***boom***