linux-kernel - Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <04ac8bcd-54f7-acd1-0764-11f6925c2c94@maciej.szmigiero.name>
Date:   Thu, 7 Apr 2022 17:32:01 +0200
From:   "Maciej S. Szmigiero" <mail@...iej.szmigiero.name>
To:     Sean Christopherson <seanjc@...gle.com>
Cc:     Maxim Levitsky <mlevitsk@...hat.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        Wanpeng Li <wanpengli@...cent.com>,
        Jim Mattson <jmattson@...gle.com>,
        Joerg Roedel <joro@...tes.org>, kvm@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH 5/8] KVM: SVM: Re-inject INT3/INTO instead of retrying the
 instruction

On 7.04.2022 01:03, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Maciej S. Szmigiero wrote:
>> On 6.04.2022 22:52, Sean Christopherson wrote:
>>> On Wed, Apr 06, 2022, Maciej S. Szmigiero wrote:
>>>> Another option for saving and restoring a VM would be to add it to
>>>> KVM_{GET,SET}_NESTED_STATE somewhere (maybe as a part of the saved VMCB12
>>>> control area?).
>>>
>>> Ooh.  What if we keep nested_run_pending=true until the injection completes?  Then
>>> we don't even need an extra flag because nested_run_pending effectively says that
>>> any and all injected events are for L1=>L2.  In KVM_GET_NESTED_STATE, shove the
>>> to-be-injected event into the normal vmc*12 injection field, and ignore all
>>> to-be-injected events in KVM_GET_VCPU_EVENTS if nested_run_pending=true.
>>>
>>> That should work even for migrating to an older KVM, as keeping nested_run_pending
>>> will cause the target to reprocess the event injection as if it were from nested
>>> VM-Enter, which it technically is.
>>
>> I guess here by "ignore all to-be-injected events in KVM_GET_VCPU_EVENTS" you
>> mean *moving* back the L1 -> L2 event to be injected from KVM internal data
>> structures like arch.nmi_injected (and so on) to the KVM_GET_NESTED_STATE-returned
>> VMCB12 EVENTINJ field (or its VMX equivalent).
>>
>> But then the VMM will need to first call KVM_GET_NESTED_STATE (which will do
>> the moving), only then KVM_GET_VCPU_EVENTS (which will then no longer show
>> these events as pending).
>> And their setters in the opposite order when restoring the VM.
> 
> I wasn't thinking of actually moving things in the source VM, only ignoring events
> in KVM_GET_VCPU_EVENTS.  Getting state shouldn't be destructive, e.g. the source VM
> should still be able to continue running.

Right, getters should not change state.

> Ahahahaha, and actually looking at the code, there's this gem in KVM_GET_VCPU_EVENTS
> 
> 	/*
> 	 * The API doesn't provide the instruction length for software
> 	 * exceptions, so don't report them. As long as the guest RIP
> 	 * isn't advanced, we should expect to encounter the exception
> 	 * again.
> 	 */
> 	if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> 		events->exception.injected = 0;
> 		events->exception.pending = 0;
> 	}
> 
> and again for soft interrupts
> 
> 	events->interrupt.injected =
> 		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> 
> so through KVM's own incompetency, it's already doing half the work.

So KVM_GET_VCPU_EVENTS was basically already explicitly broken for this
case (where RIP does not point to a INT3/INTO/INT x instruction).

> This is roughly what I had in mind.  It will "require" moving nested_run_pending
> to kvm_vcpu_arch, but I've been itching for an excuse to do that anyways.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index eb71727acecb..62c48f6a0815 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4846,6 +4846,8 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
>   static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>                                                 struct kvm_vcpu_events *events)
>   {
> +       bool drop_injected_events = vcpu->arch.nested_run_pending;
> +
>          process_nmi(vcpu);
> 
>          if (kvm_check_request(KVM_REQ_SMI, vcpu))
> @@ -4872,7 +4874,8 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>           * isn't advanced, we should expect to encounter the exception
>           * again.
>           */
> -       if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> +       if (drop_injected_events ||
> +           kvm_exception_is_soft(vcpu->arch.exception.nr)) {
>                  events->exception.injected = 0;
>                  events->exception.pending = 0;
>          } else {
> @@ -4893,13 +4896,14 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>          events->exception_has_payload = vcpu->arch.exception.has_payload;
>          events->exception_payload = vcpu->arch.exception.payload;
> 
> -       events->interrupt.injected =
> -               vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> +       events->interrupt.injected = vcpu->arch.interrupt.injected &&
> +                                    !vcpu->arch.interrupt.soft &&
> +                                    !drop_injected_events;
>          events->interrupt.nr = vcpu->arch.interrupt.nr;
>          events->interrupt.soft = 0;
>          events->interrupt.shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> 
> -       events->nmi.injected = vcpu->arch.nmi_injected;
> +       events->nmi.injected = vcpu->arch.nmi_injected && !drop_injected_events;
>          events->nmi.pending = vcpu->arch.nmi_pending != 0;
>          events->nmi.masked = static_call(kvm_x86_get_nmi_mask)(vcpu);
>          events->nmi.pad = 0;
> 

So the VMM will get VMCB12 with EVENTINJ field filled with the event
to re-inject from KVM_GET_NESTED_STATE and events.{exception,interrupt,nmi}.injected
unset from KVM_GET_VCPU_EVENTS.

Let's say now that the VMM uses this data to restore a VM: it restores nested
state by using KVM_SET_NESTED_STATE and then events by calling KVM_SET_VCPU_EVENTS.

So it ends with VMCB12 EVENTINJ field filled, but KVM injection structures
(arch.{exception,interrupt,nmi}.injected) zeroed by that later KVM_SET_VCPU_EVENTS
call.

Assuming that L1 -> L2 event injection is always based on KVM injection structures
(like we discussed earlier), rather than on a direct copy of EVENTINJ field like
it is now, before doing the first VM entry KVM will need to re-parse VMCB12 EVENTINJ
field into KVM arch.{exception,interrupt,nmi}.injected to make it work properly.

Thanks,
Maciej