[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZZRqptOaukCb7rO_@google.com>
Date: Tue, 2 Jan 2024 11:57:26 -0800
From: Sean Christopherson <seanjc@...gle.com>
To: Michal Wilczynski <michal.wilczynski@...el.com>
Cc: pbonzini@...hat.com, tglx@...utronix.de, mingo@...hat.com, bp@...en8.de,
dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
kvm@...r.kernel.org, linux-kernel@...r.kernel.org, zhi.a.wang@...el.com,
artem.bityutskiy@...ux.intel.com, yuan.yao@...el.com,
Zheyu Ma <zheyuma97@...il.com>, Maxim Levitsky <mlevitsk@...hat.com>
Subject: Re: [PATCH v1] KVM: nVMX: Fix handling triple fault on RSM instruction
+Maxim
On Fri, Dec 22, 2023, Michal Wilczynski wrote:
> Syzkaller found a warning triggered in nested_vmx_vmexit().
> vmx->nested.nested_run_pending is non-zero, even though we're in
> nested_vmx_vmexit(). Generally, trying to cancel a pending entry is
> considered a bug. However in this particular scenario, the kernel
> behavior seems correct.
>
> Syzkaller scenario:
> 1) Set up VCPU's
> 2) Run some code with KVM_RUN in L2 as a nested guest
> 3) Return from KVM_RUN
> 4) Inject KVM_SMI into the VCPU
> 5) Change the EFER register with KVM_SET_SREGS to value 0x2501
> 6) Run some code on the VCPU using KVM_RUN
> 7) Observe following behavior:
>
> kvm_smm_transition: vcpu 0: entering SMM, smbase 0x30000
> kvm_entry: vcpu 0, rip 0x8000
> kvm_entry: vcpu 0, rip 0x8000
> kvm_entry: vcpu 0, rip 0x8002
> kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x30000
> kvm_nested_vmenter: rip: 0x0000000000008002 vmcs: 0x0000000000007000
> nested_rip: 0x0000000000000000 int_ctl: 0x00000000
> event_inj: 0x00000000 nested_ept=n guest
> cr3: 0x0000000000002000
> kvm_nested_vmexit_inject: reason: TRIPLE_FAULT ext_inf1: 0x0000000000000000
> ext_inf2: 0x0000000000000000 ext_int: 0x00000000
> ext_int_err: 0x00000000
>
> What happened here is an SMI was injected immediately and the handler was
> called at address 0x8000; all is good. Later, an RSM instruction is
> executed in an emulator to return to the nested VM. em_rsm() is called,
> which leads to emulator_leave_smm(). A part of this function calls VMX/SVM
> callback, in this case vmx_leave_smm(). It attempts to set up a pending
> reentry to guest VM by calling nested_vmx_enter_non_root_mode() and sets
> vmx->nested.nested_run_pending to one. Unfortunately, later in
> emulator_leave_smm(), rsm_load_state_64() fails to write invalid EFER to
> the MSR. This results in em_rsm() calling triple_fault callback. At this
> point it's clear that the KVM should call the vmexit, but
> vmx->nested.nested_run_pending is left set to 1. To fix this reset the
> vmx->nested.nested_run_pending flag in triple_fault handler.
>
> TL;DR (courtesy of Yuan Yao)
> Clear nested_run_pending in case of RSM failure on return from L2 SMM.
KVM doesn't emulate SMM for L2. On an injected SMI, KVM forces a syntethic nested
VM-Exit to get from L2 to L1, and then emulates SMM in the context of L1.
> The pending VMENTRY to L2 should be cancelled due to such failure leads
> to triple fault which should be injected to L1.
>
> Possible alternative approach:
> While the proposed approach works, the concern is that it would be
> simpler, and more readable to cancel the nested_run_pending in em_rsm().
> This would, however, require introducing new callback e.g,
> post_leave_smm(), that would cancel nested_run_pending in case of a
> failure to resume from SMM.
>
> Additionally, while the proposed code fixes VMX specific issue, SVM also
> might suffer from similar problem as it also uses it's own
> nested_run_pending variable.
>
> Reported-by: Zheyu Ma <zheyuma97@...il.com>
> Closes: https://lore.kernel.org/all/CAMhUBjmXMYsEoVYw_M8hSZjBMHh24i88QYm-RY6HDta5YZ7Wgw@mail.gmail.com
Fixes: 759cbd59674a ("KVM: x86: nSVM/nVMX: set nested_run_pending on VM entry which is a result of RSM")
> Signed-off-by: Michal Wilczynski <michal.wilczynski@...el.com>
> ---
> arch/x86/kvm/vmx/nested.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index c5ec0ef51ff7..44432e19eea6 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4904,7 +4904,16 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>
> static void nested_vmx_triple_fault(struct kvm_vcpu *vcpu)
> {
> + struct vcpu_vmx *vmx = to_vmx(vcpu);
> +
> kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> +
> + /* In case of a triple fault, cancel the nested reentry. This may occur
/*
* Multi-line comments should look like this. Blah blah blab blah blah
* blah blah blah blah.
*/
> + * when the RSM instruction fails while attempting to restore the state
> + * from SMRAM.
> + */
> + vmx->nested.nested_run_pending = 0;
Argh. KVM's handling of SMIs while L2 is active is complete garbage. As explained
by the comment in vmx_enter_smm(), the L2<->SMM transitions should have a completely
custom flow and not piggyback/usurp nested VM-Exit/VM-Entry.
/*
* TODO: Implement custom flows for forcing the vCPU out/in of L2 on
* SMI and RSM. Using the common VM-Exit + VM-Enter routines is wrong
* SMI and RSM only modify state that is saved and restored via SMRAM.
* E.g. most MSRs are left untouched, but many are modified by VM-Exit
* and VM-Enter, and thus L2's values may be corrupted on SMI+RSM.
*/
The Fixes: commit above added a hack on top of the hack. But it's not entirely
clear from the changelog exactly what was being fixed.
While RSM induced VM entries are not full VM entries,
they still need to be followed by actual VM entry to complete it,
unlike setting the nested state.
This patch fixes boot of hyperv and SMM enabled
windows VM running nested on KVM, which fail due
to this issue combined with lack of dirty bit setting.
My first guess would be event injection, but that shouldn't be relevant to RSM.
Architecturally, events (SMIs, NMIs, IRQs, etc.) are recognized at instruction
boundaries, but except for SMIs (see below), KVM naturally defers injection until
an instruction boundary by virtue of delivering events via the VMCS/VMCB, i.e. by
waiting to deliver events until successfully re-entering the guest.
Nested VM-Enter is a special snowflake because KVM needs to finish injecting events
from vmcs12 before injecting any synthetic events, i.e. nested_run_pending ensures
that KVM wouldn't clobber/override an L2 event coming from L1. In other words,
nested_run_pending is much more specific than just needing to wait for an instruction
boundary.
So while the "wait until the CPU is at an instruction boundary" applies to RSM,
it's not immediately obvious to me why setting nested_run_pending is necessary.
And setting nested_run_pending *after* calling nested_vmx_enter_non_root_mode()
is nasty. nested_vmx_enter_non_root_mode() and its helpers use nested_run_pending
to determine whether or not to enforce various consistency checks and other
behaviors. And a more minor issue is that stat.nested_run will be incorrectly
incremented.
As a stop gap, something like this patch is not awful, though I would strongly
prefer to be more precise and not clear it on all triple faults. We've had KVM
bugs where KVM prematurely synthesizes triple fault on an actual nested VM-Enter,
and those would be covered up by this fix.
But due to nested_run_pending being (unnecessarily) buried in vendor structs, it
might actually be easier to do a cleaner fix. E.g. add yet another flag to track
that a hardware VM-Enter needs to be completed in order to complete instruction
emulation.
And as alluded to above, there's another bug lurking. Events that are *emulated*
by KVM must not be emulated until KVM knows the vCPU is at an instruction boundary.
Specifically, enter_smm() shouldn't be invoked while KVM is in the middle of
instruction emulation (even if "emulation" is just setting registers and skipping
the instruction). Theoretically, that could be fixed by honoring the existing
at_instruction_boundary flag for SMIs, but that'd be a rather large change and
at_instruction_boundary is nowhere near accurate enough to use right now.
Anyways, before we do anything, I'd like to get Maxim's input on what exactly was
addressed by 759cbd59674a.
Powered by blists - more mailing lists