linux-kernel - Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZxAL6thxEH67CpW7@google.com>
Date: Wed, 16 Oct 2024 11:54:34 -0700
From: Sean Christopherson <seanjc@...gle.com>
To: Chao Gao <chao.gao@...el.com>
Cc: "Markku Ahvenjärvi" <mankku@...il.com>, bp@...en8.de, dave.hansen@...ux.intel.com, 
	hpa@...or.com, janne.karhunen@...il.com, kvm@...r.kernel.org, 
	linux-kernel@...r.kernel.org, mingo@...hat.com, pbonzini@...hat.com, 
	tglx@...utronix.de, x86@...nel.org
Subject: Re: [PATCH 1/1] KVM: nVMX: update VPPR on vmlaunch/vmresume

On Thu, Oct 10, 2024, Chao Gao wrote:
> The issue is that KVM does not properly update vmcs01's SVI. In this case, L1
> does not intercept EOI MSR writes from the deprivileged host (L2), so KVM

Oof.  It's not simply that L1 doesn't intercept EOI, it's also that L1 doesn't
have APICv enabled.

Note, the only reason there aren't a slew of other issues in this setup is that
KVM takes all APICv-related controls from vmcs12.  I.e. if L1 is sharing its APIC
with L2, then KVM will effectively disable APICv when running L2.  Which is
rather unfortunate for the pKVM-on-KVM case (lost performance), but functionally
it's ok.

E.g. if KVM enabled APICv in an APIC-passthrough scenario, then KVM would need
to load the correct EOI bitmaps when running L2, and keep them up-to-date for
both L1 and L2.

> emulates EOI writes by clearing the highest bit in vISR and updating vPPR.
> However, SVI in vmcs01 is not updated, causing it to retain the interrupt vector
> that was just EOI'd. On the next VM-entry to L1, the CPU performs PPR
> virtualization, setting vPPR to SVI & 0xf0, which results in an incorrect vPPR
> 
> Can you try this fix?
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 4a93ac1b9be9..3d24194a648d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -122,6 +122,8 @@
>  #define KVM_REQ_HV_TLB_FLUSH \
>  	KVM_ARCH_REQ_FLAGS(32, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  #define KVM_REQ_UPDATE_PROTECTED_GUEST_STATE	KVM_ARCH_REQ(34)
> +#define KVM_REQ_UPDATE_HWAPIC_ISR \
> +	KVM_ARCH_REQ_FLAGS(35, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
>  
>  #define CR0_RESERVED_BITS                                               \
>  	(~(unsigned long)(X86_CR0_PE | X86_CR0_MP | X86_CR0_EM | X86_CR0_TS \
> @@ -764,6 +766,7 @@ struct kvm_vcpu_arch {
>  	u64 apic_base;
>  	struct kvm_lapic *apic;    /* kernel irqchip context */
>  	bool load_eoi_exitmap_pending;
> +	bool update_hwapic_isr;

Obviously not your fault, but I'm really beginning to hate all of the flags we're
accumulating, e.g. in addition to load_eoi_exitmap_pending and potentially
update_hwapic_isr, VMX also has:

	bool change_vmcs01_virtual_apic_mode;
	bool reload_vmcs01_apic_access_page;
	bool update_vmcs01_cpu_dirty_logging;
	bool update_vmcs01_apicv_status;

Doesn't (and shouldn't) need to be handled as part of this, but I wonder if we
can clean things up a bit by impementing something along the lines of deferred
requests, e.g. nested_vmx_defer_request(NESTED_VMX_REQ_XXX, vcpu).  We'd still
need to add each request, but it would cut down on some of the boilerplate, e.g.

	if (vmx->nested.change_vmcs01_virtual_apic_mode) {
		vmx->nested.change_vmcs01_virtual_apic_mode = false;
		vmx_set_virtual_apic_mode(vcpu);
	}

would become

	if (nested_vmx_check_request(NESTED_VMX_REQ_VAPIC_MODE, vcpu))
		vmx_set_virtual_apic_mode(vcpu);

An alternative idea would be to use KVM_REQ_XXX directly, and shuffle them from
vmx->nested to vcpu, but that would pollute KVM_REQ_XXX for the cases where KVM
doesn't need a "normal" request.

Another idea would be to temporarily switch to vmcs01 as needed, which would
probably be ok for most cases since they are uncommon events, but for EOIs in
this, situation the overhead would be non-trivial and completely unnecessary.

>  	DECLARE_BITMAP(ioapic_handled_vectors, 256);
>  	unsigned long apic_attention;
>  	int32_t apic_arb_prio;
> diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
> index b1eb46e26b2e..a8dad16161e4 100644
> --- a/arch/x86/kvm/kvm_cache_regs.h
> +++ b/arch/x86/kvm/kvm_cache_regs.h
> @@ -220,6 +220,11 @@ static inline void leave_guest_mode(struct kvm_vcpu *vcpu)
>  		kvm_make_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu);
>  	}
>  
> +	if (vcpu->arch.update_hwapic_isr) {
> +		vcpu->arch.update_hwapic_isr = false;
> +		kvm_make_request(KVM_REQ_UPDATE_HWAPIC_ISR, vcpu);

I don't think we need a new request for this, KVM can refresh SVI directly in
nested_vmx_vmexit(), e.g. along with change_vmcs01_virtual_apic_mode and friends.

> +	}
> +
>  	vcpu->stat.guest_mode = 0;
>  }
>  
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 5bb481aefcbc..d6a03c30f085 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -800,6 +800,9 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic)
>  	if (!__apic_test_and_clear_vector(vec, apic->regs + APIC_ISR))
>  		return;
>  
> +	if (is_guest_mode(apic->vcpu))

As above, I think this needs to be

	if (is_guest_mode(apic->vcpu) && !nested_cpu_has_vid(get_vmcs12(vcpu)))

because if virtual interrupt delivery is enabled, then EOIs are virtualized.
Which means that this needs to be handled in vmx_hwapic_isr_update().

Hmm, actually, there's more to it, I think.  If virtual interrupt deliver is
disabled for L2, then writing vmcs02 is pointless because GUEST_INTR_STATUS is
unused by the CPU.

Argh, but hwapic_isr_update() doesn't take @vcpu.  That's easy enough to fix,
just annoying.

Chao, can you provide your SoB for your code?  If you've no objections, I'll
massage it to avoid using a request and write a changelog, and then post it as
part of a small series.

Thanks again!

> +		apic->vcpu->arch.update_hwapic_isr = true;
> +
>  	/*
>  	 * We do get here for APIC virtualization enabled if the guest
>  	 * uses the Hyper-V APIC enlightenment.  In this case we may need