linux-kernel - Re: [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YUIunxwjea/wq3gd@google.com>
Date:   Wed, 15 Sep 2021 17:34:23 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Vitaly Kuznetsov <vkuznets@...hat.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org, Reiji Watanabe <reijiw@...gle.com>
Subject: Re: [PATCH 2/3] KVM: VMX: Move RESET emulation to vmx_vcpu_reset()

On Wed, Sep 15, 2021, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@...gle.com> writes:
> > +static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
> > +{
> > +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > +
> > +	init_vmcs(vmx);
> > +
> > +	if (nested)
> > +		memcpy(&vmx->nested.msrs, &vmcs_config.nested, sizeof(vmx->nested.msrs));
> > +
> > +	vcpu_setup_sgx_lepubkeyhash(vcpu);
> > +
> > +	vmx->nested.posted_intr_nv = -1;
> > +	vmx->nested.current_vmptr = -1ull;
> > +	vmx->nested.hv_evmcs_vmptr = EVMPTR_INVALID;
> 
> What would happen in (hypothetical) case when enlightened VMCS is
> currently in use? If we zap 'hv_evmcs_vmptr' here, the consequent
> nested_release_evmcs() (called from
> nested_vmx_handle_enlightened_vmptrld(), for example) will not do 
> kvm_vcpu_unmap() while it should.

The short answer is that there's a lot of stuff that needs to be addressed before
KVM can expose a RESET ioctl().  My goal with these patches is to carve out the
stubs and move the few bits of RESET emulation into the "stubs".  This is the same
answer for the MSR question/comment at the end.

> This, however, got me thinking: should we free all-things-nested with
> free_nested()/nested_vmx_free_vcpu() upon vcpu reset? I can't seem to
> find us doing that... (I do remember that INIT is blocked in VMX-root
> mode and nobody else besides kvm_arch_vcpu_create()/
> kvm_apic_accept_events() seems to call kvm_vcpu_reset()) but maybe we
> should at least add a WARN_ON() guardian here?

I think that makes sense.  Maybe use CR0 as a sentinel since it has a non-zero
RESET value?  E.g. WARN if CR0 is non-zero at RESET.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 86539c1686fa..3ac074376821 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10813,6 +10813,11 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
        unsigned long new_cr0;
        u32 eax, dummy;

+       /*
+        * <comment about KVM not supporting arbitrary RESET>
+        */
+       WARN_ON_ONCE(!init_event && old_cr0);
+
        kvm_lapic_reset(vcpu, init_event);

        vcpu->arch.hflags = 0;

Huh, typing that out made me realize commit 0aa1837533e5 ("KVM: x86: Properly
reset MMU context at vCPU RESET/INIT") technically introduced a bug.  kvm_vcpu_reset()
does kvm_read_cr0() and thus reads vmcs.GUEST_CR0 because vcpu->arch.regs_avail is
(correctly) not stuffed to ALL_ONES until later in kvm_vcpu_reset().  init_vmcs()
doesn't explicitly zero vmcs.GUEST_CR0 (along with many other guest fields), and
so VMREAD(GUEST_CR0) is technically consuming garbage.  In practice, it's consuming
'0' because no known CPU or VMM inverts values in the VMCS, i.e. zero allocating
the VMCS is functionally equivalent to writing '0' to all fields via VMWRITE.

And staring more at kvm_vcpu_reset(), this code is terrifying for INIT

	memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
	vcpu->arch.regs_avail = ~0;
	vcpu->arch.regs_dirty = ~0;

because it means cr0 and cr4 are marked available+dirty without immediately writing
vcpu->arch.cr0/cr4.  And VMX subtly relies on that, as vmx_set_cr0() grabs CR0.PG
via kvm_read_cr0_bits(), i.e. zeroing vcpu->arch.cr0 would "break" the INIT flow.
Ignoring for the moment that CR0.PG is never guest-owned and thus never stale in
vcpu->arch.cr0, KVM is also technically relying on the earlier kvm_read_cr0() in
kvm_vcpu_reset() to ensure vcpu->arch.cr0 is fresh.

Stuffing regs_avail technically means vmx_set_rflags() -> vmx_get_rflags() is
consuming stale data.  It doesn't matter in practice because the old value is
only used to re-evaluate vmx->emulation_required, which is guaranteed to be up
refreshed by vmx_set_cr0() and friends.  PDPTRs and EXIT_INFO are in a similar
boat; KVM shouldn't be reading those fields (CR0.PG=0, not an exit path), but
marking them dirty without actually updating the cached values is wrong.

There's also one concrete bug: KVM doesn't set vcpu->arch.cr3=0 on RESET/INIT.
That bug has gone unnoticed because no real word BIOS/kernel is going to rely on
INIT to set CR3=0.  

I'm strongly leaning towards stuffing regs_avail/dirty in kvm_arch_vcpu_create(),
and relying on explicit kvm_register_mark_dirty() calls for the 3 or so cases where
x86 code writes a vcpu->arch register directly.  That would fix the CR0 read bug
and also prevent subtle bugs from sneaking in.  Adding new EXREGS would be slightly
more costly, but IMO that's a good thing as would force us to actually think about
how to handle each register.

E.g. implement this over 2-3 patches:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 114847253e0a..743146ac8307 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4385,6 +4385,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
        kvm_set_cr8(vcpu, 0);

        vmx_segment_cache_clear(vmx);
+       kvm_register_mark_available(vcpu, VCPU_EXREG_SEGMENTS);

        seg_setup(VCPU_SREG_CS);
        vmcs_write16(GUEST_CS_SELECTOR, 0xf000);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 86539c1686fa..ab907a0b9eeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10656,6 +10656,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
        int r;

        vcpu->arch.last_vmentry_cpu = -1;
+       vcpu->arch.regs_avail = ~0;
+       vcpu->arch.regs_dirty = ~0;

        if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
                vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
@@ -10874,9 +10876,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
                vcpu->arch.xcr0 = XFEATURE_MASK_FP;
        }

+       /* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
        memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
-       vcpu->arch.regs_avail = ~0;
-       vcpu->arch.regs_dirty = ~0;
+       kvm_register_mark_dirty(vcpu, VCPU_REGS_RSP);

        /*
         * Fall back to KVM's default Family/Model/Stepping of 0x600 (P6/Athlon)
@@ -10897,6 +10899,9 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
        kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
        kvm_rip_write(vcpu, 0xfff0);

+       vcpu->arch.cr3 = 0;
+       kvm_register_mark_dirty(vcpu, VCPU_EXREG_CR3);
+
        /*
         * CR0.CD/NW are set on RESET, preserved on INIT.  Note, some versions
         * of Intel's SDM list CD/NW as being set on INIT, but they contradict

> Side topic: we don't seem to reset Hyper-V specific MSRs either,
> probably relying on userspace VMM doing the right thing but this is also
> not ideal I believe.