linux-kernel - Re: [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <227adbe6e8d82ad4c5a803c117d4231808a0e451.camel@redhat.com>
Date:   Mon, 18 Apr 2022 15:49:25 +0300
From:   Maxim Levitsky <mlevitsk@...hat.com>
To:     Sean Christopherson <seanjc@...gle.com>,
        Paolo Bonzini <pbonzini@...hat.com>
Cc:     Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Gaoning Pan <pgn@....edu.cn>,
        Yongkang Jia <kangel@....edu.cn>
Subject: Re: [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active
 until L1 is active

On Sat, 2022-04-16 at 03:42 +0000, Sean Christopherson wrote:
> Defer APICv updates that occur while L2 is active until nested VM-Exit,
> i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
> is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
> vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
> APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
> becomes unhibited while L2 is active, KVM will set various APICv controls
> in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
> running with nested_early_check=1, KVM blames L1 and chaos ensues.
> 
> The obvious elephant in the room is whether or not KVM needs to disallow
> APICv in L2 if it is inhibited in L1.  Luckily, that's largely a moot
> point as the only dynamic inhibit conditions that affect VMX are Hyper-V
> and IRQ blocking.  IRQ blocking is firmly a debug-only feature, and L1
> probably deserves whatever mess it gets by enabling AutoEOI while running
> L2 with APICv enabled.

Let me explain:
 
a vCPU either runs L1 or it "sort of" runs both L1 and L2 in regard to
how it sends/receives interrupts: 

Sending interrupts while nested: 
   While in guest mode, a vCPU can in theory use both L2's and L1's APIC 
   (if L2 has access to L1's APIC mmio/msrs). 

Receiving interrupts while nested: 
 
   While in guest mode, a vCPU can receive interrupts that target either L1 or L2 code it "runs".
   Later request will hopefully cause it to jump to the the interrupt handler without VMexit,
   while the former will cause VMexit, and let L1 run again.
 
   On AVIC interrupt requests can come from either L1 or L2, while on APICv 
   they have to come from L1, because APICv doesn't yet support nested IPI virtualization.
 
 
Another thing to note is that pretty much, APICv/AVIC inhibition is for 
L1 use of APICv/AVIC only.

In fact I won't object to rename the inhibition functions to explicitly state this.
 
When L2 uses APICv/AVIC, we just safely passthrough its usage to the real hardware.

If we were to to need to inhibit it, we would have to emulate APICv/AVIC so that L1 would
still think that it can use it - thankfully there is no need for that.


So how it all works together:
 
Sending interrupts while nested:
 
   - only L2's avic/apicv is accelerated because its settings are in vmcb/vmcs02.
 
   - if L1 allows L2 to access its APIC, then it accessed through regular MMIO/msr 
     interception and should still work.
     (this reminds me to write a unit test for this)
 
 
Receiving interrupts while nested:
 
  - Interrupts that target L2 are received via APICv/AVIC which is active in vmcb/vmcs02
 
    This APICv/AVIC should never be inhibited because it merely works on behalf of L1's hypervisor,
    which should decide if it wants to inhibit it.
 
    In fact even if L1's APICv/AVIC is always inhibited (e.g when L1 uses x2apic on AMD),
    L2 should still be able to use APICv/AVIC just fine - and it does work *very fine* with my nested AVIC code.
 
    Another way to say it, is that the whole APICv/AVIC inhibition mechanism is 
    in fact L1'1 APICv/AVIC inhibition,
    and there is nothing wrong when L1's APICv/AVIC is inhibited while L2 is running:
 
    That just means that vCPUs which don't run L2, will now stop using AVIC/APICv.
    It will also update the private memslot, which also only L1 uses, while L2 will continue
    using its APICv/AVIC as if nothing happened (it doesn't use this private memslot,
    but should itself map the apic mmio in its own EPT/NPT tables).
 
 
 - Interrupts that target L1: here there is a big difference between APICv and AVIC:

   On APICv, despite the fact that vmcs02 uses L2' APICv, we can still receive them
   as normal interrupts, due to the trick that we use different posted interrupt vectors.

   So for receiving, L1's APICv still "sort of works" while L2 is running,
   and with help of the interrupt handler will cause L2 to vmexit to L1 when such
   interrupt is received.
 
   That is why on APICv there is no need to inhibit L1's APICv when entering nested guest,
   because from peer POV, this vCPU's APICv is running just fine.
 

   On AVIC however, there is only one doorbell, and this is why I have to inhibit the AVIC
   while entering the nested guest, although the update it does to vmcb01 is not really needed,

   but what is needed is to inhibit this vCPU peers from sending it interrupts via AVIC
   (this includes clearing is_running bit in physid table, in IOMMU, and also clearing
   per-vcpu avic enabled bit, so that KVM doesn't send IPIs to this vCPU as well).
 
 
 
Having said all that, delaying L1's APICv inhibition to the next nested VM exit should not
cause any issues, as vmcs01 is not active at all while running nested,
and APICv inhibition is all about changing vmcs01 settings.

(That should be revised when IPI virtualization is merged).
 
On AVIC I don't bother with that, as inhibition explicitly updates vmcb01 and 
L1's physid page and should not have any impact on L2.
 
 
So I am not against this patch, but please update the commit message with the fact that it is all right to 
have L1 APICv inhibited/uninhibited while running nested, even though it is not likely
to happen with APICv.
 
Best regards,
	Maxim Levitsky
 


> 
> Lack of dynamic toggling is also why this scenario is all but impossible
> to encounter in KVM's current form.  But a future patch will pend an
> APICv update request _during_ vCPU creation to plug a race where a vCPU
> that's being created doesn't get included in the "all vCPUs request"
> because it's not yet visible to other vCPUs.  If userspaces restores L2
> after VM creation (hello, KVM selftests), the first KVM_RUN will occur
> while L2 is active and thus service the APICv update request made during
> VM creation.
> 
> Cc: stable@...r.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@...gle.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 5 +++++
>  arch/x86/kvm/vmx/vmx.c    | 5 +++++
>  arch/x86/kvm/vmx/vmx.h    | 1 +
>  3 files changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index a6688663da4d..f5cb18e00e78 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4640,6 +4640,11 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>  		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>  	}
>  
> +	if (vmx->nested.update_vmcs01_apicv_status) {
> +		vmx->nested.update_vmcs01_apicv_status = false;
> +		kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
> +	}
> +
>  	if ((vm_exit_reason != -1) &&
>  	    (enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)))
>  		vmx->nested.need_vmcs12_to_shadow_sync = true;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cf8581978bce..4c407a34b11e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4174,6 +4174,11 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
> +	if (is_guest_mode(vcpu)) {
> +		vmx->nested.update_vmcs01_apicv_status = true;
> +		return;
> +	}
> +
>  	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>  	if (cpu_has_secondary_exec_ctrls()) {
>  		if (kvm_vcpu_apicv_active(vcpu))
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 9c6bfcd84008..b98c7e96697a 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -183,6 +183,7 @@ struct nested_vmx {
>  	bool change_vmcs01_virtual_apic_mode;
>  	bool reload_vmcs01_apic_access_page;
>  	bool update_vmcs01_cpu_dirty_logging;
> +	bool update_vmcs01_apicv_status;
>  
>  	/*
>  	 * Enlightened VMCS has been enabled. It does not mean that L1 has to