linux-kernel - Re: [PATCH] KVM: X86: set EXITING_GUEST

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y5D1JWutV7+nARxS@google.com>
Date:   Wed, 7 Dec 2022 20:18:45 +0000
From:   Sean Christopherson <seanjc@...gle.com>
To:     Jon Kohler <jon@...anix.com>
Cc:     Paolo Bonzini <pbonzini@...hat.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "H. Peter Anvin" <hpa@...or.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] KVM: X86: set EXITING_GUEST_MODE as soon as vCPU exits

On Mon, Dec 05, 2022, Jon Kohler wrote:
> 
> > On Dec 1, 2022, at 2:17 PM, Sean Christopherson <seanjc@...gle.com> wrote:
> >> Changing vcpu->mode away from IN_GUEST_MODE as early as possible
> > 
> > Except this isn't as early as possible.  If we're going to bother doing something
> > like this, my vote is to move it into assembly.
> 
> In vmenter.S, tacking on to call vmx_spec_ctrl_restore_host seemed like the
> most logical place after handling all of the state saves and RSB work. Are you
> saying put it even closer to the exit, meaning before the FILL_RETURN_BUFFER?

Yes, assuming that's safe, in which case it's truly the "as early as possible"
location.

> >> gives IPI senders as much runway as possible to avoid ringing
> >> doorbell or sending posted interrupt IPI in AMD and Intel,
> >> respectively. Since this is done without an explicit memory
> >> barrier, the worst case is that the IPI sender sees IN_GUEST_MODE
> >> still and sends a spurious event, which is the behavior prior
> >> to this patch.
> > 
> > No, worst case scenario is that kvm_vcpu_exiting_guest_mode() sees EXITING_GUEST_MODE
> > and doesn't kick the vCPU.  For "kicks" that set a request, kvm_vcpu_exit_request()
> > will punt the vCPU out of the tight run loop, though there might be ordering issues.
> > 
> > But whether or not there are ordering issues is a moot point since there are uses
> > of kvm_vcpu_kick() that aren't accompanied by a request, e.g. to purge the PML
> > buffers.  In other words, kvm_vcpu_kick() absolutely cannot have false negatives.
> > We could modify KVM to require a request when using kvm_vcpu_kick(), but that's
> > a bit of a hack, and all of the possible ordering problems is still a pile of
> > complexity I'd rather avoid.
> > 
> > No small part of me thinks we'd be better off adding a dedicated flag to very
> > precisely track whether or not the vCPU is truly "in the guest" for the purposes
> > of sending IPIs.  Things like kicks have different requirements around IN_GUEST_MODE
> > than sending interrupts, e.g. KVM manually processes the IRR on every VM-Enter and
> > so lack of an IPI is a non-issue, whereas missing an IPI for a kick is problematic.
> > In other words, EXITING_GUEST_MODE really needs to mean "existing the run loop".
> 
> Do you mean:
> “one small part” (as in give this a shot, maybe), 
> or 
> “no small part” (as in good-god-don’t-do-this!)

Neither.  "No small part" as in "Most of my brain", i.e. "I haven't completely
thought things through, but I think we'd be better off adding a dedicated flag".

> I’m assuming you meant one small part :) sure, how about something like:
> 
> To my earlier comment about when to do this within a few instructions, I don’t want
> to clobber other stuff happening as part of the enter/exit, what if we repurposed/renamed
> vmx_update_host_rsp and vmx_spec_ctrl_restore_host to make them “do stuff before
> entry” and “do stuff right after entry returns” functions. That way we wouldn’t have to
> add another other function calls or change the existing control flow all that much.

I'd prefer not to wrap vmx_update_host_rsp(), that thing is a very special
snowflake.

I don't see why we'd have to add function calls or change the existing control
flow anyways.  The asm flows for VMX and SVM both take the vCPU in the form of
@vmx and @svm, so accessing the proposed excution mode field is just a matter of
adding an entry in arch/x86/kvm/kvm-asm-offsets.c.

And now that kvm-asm-offsets.c exists, I think it makes sense to drop the @regs
parameter for __vmx_vcpu_run(), e.g. to roughly match __svm_vcpu_run().

With that done as prep, accessing the vCPU immediately before/after VM-Enter and
VM-Exit is easy.

As a rough, incomplete sketch for VMX:

diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 766c6b3ef5ed..f80553e34f26 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -102,7 +102,7 @@ SYM_FUNC_START(__vmx_vcpu_run)
         * an LFENCE to stop speculation from skipping the wrmsr.
         */
 
-       /* Load @regs to RAX. */
+       /* Load @vmx to RAX. */
        mov (%_ASM_SP), %_ASM_AX
 
        /* Check if vmlaunch or vmresume is needed */
@@ -125,7 +125,11 @@ SYM_FUNC_START(__vmx_vcpu_run)
        mov VCPU_R14(%_ASM_AX), %r14
        mov VCPU_R15(%_ASM_AX), %r15
 #endif
-       /* Load guest RAX.  This kills the @regs pointer! */
+
+       /* Mark the vCPU as executing in the guest! */
+       movb $IN_GUEST_MODE, VCPU_EXECUTION_MODE(_%ASM_AX)
+
+       /* Load guest RAX.  This kills the @vmx pointer! */
        mov VCPU_RAX(%_ASM_AX), %_ASM_AX
 
        /* Check EFLAGS.ZF from 'testb' above */
@@ -163,9 +167,11 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
        /* Temporarily save guest's RAX. */
        push %_ASM_AX
 
-       /* Reload @regs to RAX. */
+       /* Reload @vmx to RAX. */
        mov WORD_SIZE(%_ASM_SP), %_ASM_AX
 
+       movb $IN_HOST_MODE, VCPU_EXECUTION_MODE(_%ASM_AX)
+
        /* Save all guest registers, including RAX from the stack */
        pop           VCPU_RAX(%_ASM_AX)
        mov %_ASM_CX, VCPU_RCX(%_ASM_AX)
@@ -189,9 +195,12 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
        xor %ebx, %ebx
 
 .Lclear_regs:
-       /* Discard @regs.  The register is irrelevant, it just can't be RBX. */
+       /* Pop @vmx.  The register can be anything except RBX. */
        pop %_ASM_AX
 
+       /* Set the execution mode (again) in case of VM-Fail. */
+       movb $IN_HOST_MODE, VCPU_EXECUTION_MODE(_%ASM_AX)
+
        /*
         * Clear all general purpose registers except RSP and RBX to prevent
         * speculative use of the guest's values, even those that are reloaded


> In “do before” we could set something like vcpu->non_root_mode = true
> In “do after” we could set vcpu->non_root_mode = false
> 
> Or perhaps something like (respectively)
> vcpu->operational_state = NON_ROOT_MODE
> vcpu->operational_state = ROOT_MODE
>
> Using the root/non-root moniker would precisely track when the whether the 
> vCPU is in guest, and aligns with the language used to describe such a state
> from the SDM.
> 
> Thoughts?

No, we should use KVM-defined names, not VMX-defined names, because we'll want
the same logic for SVM.  If we first rename GUEST_MODE => RUN_LOOP, then we can
use IN_GUEST_MODE and IN_HOST_MODE or whatever.