linux-kernel - Re: [PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while single stepping

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7d39a4c9-553d-29fc-b12e-ebbe505f823a@siemens.com>
Date:   Thu, 18 Mar 2021 17:02:45 +0100
From:   Jan Kiszka <jan.kiszka@...mens.com>
To:     Maxim Levitsky <mlevitsk@...hat.com>,
        Sean Christopherson <seanjc@...gle.com>
Cc:     kvm list <kvm@...r.kernel.org>,
        Vitaly Kuznetsov <vkuznets@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Thomas Gleixner <tglx@...utronix.de>,
        Wanpeng Li <wanpengli@...cent.com>,
        Kieran Bingham <kbingham@...nel.org>,
        Jessica Yu <jeyu@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" <x86@...nel.org>,
        Joerg Roedel <joro@...tes.org>,
        Jim Mattson <jmattson@...gle.com>,
        Borislav Petkov <bp@...en8.de>,
        Stefano Garzarella <sgarzare@...hat.com>,
        "H. Peter Anvin" <hpa@...or.com>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Ingo Molnar <mingo@...hat.com>
Subject: Re: [PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while
 single stepping

[only saw this now, or delivery to me was delayed - anyway]

On 16.03.21 19:02, Maxim Levitsky wrote:
> On Tue, 2021-03-16 at 18:01 +0100, Jan Kiszka wrote:
>> On 16.03.21 17:50, Sean Christopherson wrote:
>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote:
>>>> On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote:
>>>>> Back then, when I was hacking on the gdb-stub and KVM support, the
>>>>> monitor trap flag was not yet broadly available, but the idea to once
>>>>> use it was already there. Now it can be considered broadly available,
>>>>> but it would still require some changes to get it in.
>>>>>
>>>>> Unfortunately, we don't have such thing with SVM, even recent versions,
>>>>> right? So, a proper way of avoiding diverting event injections while we
>>>>> are having the guest in an "incorrect" state should definitely be the goal.
>>>> Yes, I am not aware of anything like monitor trap on SVM.
>>>>
>>>>> Given that KVM knows whether TF originates solely from guest debugging
>>>>> or was (also) injected by the guest, we should be able to identify the
>>>>> cases where your approach is best to apply. And that without any extra
>>>>> control knob that everyone will only forget to set.
>>>> Well I think that the downside of this patch is that the user might actually
>>>> want to single step into an interrupt handler, and this patch makes it a bit
>>>> more complicated, and changes the default behavior.
>>>
>>> Yes.  And, as is, this also blocks NMIs and SMIs.  I suspect it also doesn't
>>> prevent weirdness if the guest is running in L2, since IRQs for L1 will cause
>>> exits from L2 during nested_ops->check_events().
>>>
>>>> I have no objections though to use this patch as is, or at least make this
>>>> the new default with a new flag to override this.
>>>
>>> That's less bad, but IMO still violates the principle of least surprise, e.g.
>>> someone that is single-stepping a guest and is expecting an IRQ to fire will be
>>> all kinds of confused if they see all the proper IRR, ISR, EFLAGS.IF, etc...
>>> settings, but no interrupt.
>>
>> From my practical experience with debugging guests via single step,
>> seeing an interrupt in that case is everything but handy and generally
>> also not expected (though logical, I agree). IOW: When there is a knob
>> for it, it will remain off in 99% of the time.
>>
>> But I see the point of having some control, in an ideal world also an
>> indication that there are pending events, permitting the user to decide
>> what to do. But I suspect the gdb frontend and protocol does not easily
>> permit that.
> 
> Qemu gdbstub actually does have control over suppression of the interrupts
> over a single step and it is even enabled by default:
> 
> https://qemu.readthedocs.io/en/latest/system/gdb.html
> (advanced debug options)
> 

Ah, cool! Absolutely in line with what we need.

> However it is currently only implemented in TCG (software emulator) mode 
> and not in KVM mode (I can argue that this is a qemu bug).

Maybe the behavior of old KVM was not exposing the issue, thus no one
cared. As I wrote in the other mail today, even some recent kernel do
not seem to break single-stepping, for yet unknown reasons.

> 
> So my plan was to add a new kvm guest debug flag KVM_GUESTDBG_BLOCKEVENTS,
> and let qemu enable it when its 'NOIRQ' mode is enabled (it is by default).
> 
> However due to the discussion in this thread about the leakage of the RFLAGS.TF,
> I wonder if kvm should by default suppress events and have something like
> KVM_GUESTDBG_SSTEP_ALLOW_EVENTS to override this and wire 
> that to qemu's NOIRQ=false case.
> 
> This will allow older qemu to work correctly and new qemu will be able to choose
> the old less ideal behavior.

Sounds very reasonable to me.

> 
>>
>>>> Sean Christopherson, what do you think?
>>>
>>> Rather than block all events in KVM, what about having QEMU "pause" the timer?
>>> E.g. save MSR_TSC_DEADLINE and APIC_TMICT (or inspect the guest to find out
>>> which flavor it's using), clear them to zero, then restore both when
>>> single-stepping is disabled.  I think that will work?
>>>
>>
>> No one can stop the clock, and timers are only one source of interrupts.
>> Plus they do not all come from QEMU, some also from KVM or in-kernel
>> sources directly. Would quickly become a mess.
> 
> This, plus as we see, even changing with RFLAGS.TF leaks it.

As I wrote: When we take events, the leakage must be stopped for that
case. But that might be a bit more tricky because we need to stop on the
first instruction in the interrupt handler, thus need some TF, but we
must also remove it again from the flags saved for the interrupt context
on the guest's interrupt/exception handler stack.

> Changing things like MSR_TSC_DEADLINE will also make it visible to the guest,
> sooner or later and is a mess that I rather not get into.
> 
> It is _possible_ to disable timer interrupts 'out of band', but that is messy too
> if done from userspace. For example, what if the timer interrupt is already pending
> in local apic, when qemu decides to single step?
> 
> Also with gdbstub the user doesn't have to stop all vcpus (there is a non-stop mode),
> in which only some vcpus are stopped which is actually a very cool feature,
> and of course running vcpus can raise events.
> 
> Also interrupts can indeed come from things like vhost.
> 

Exactly.

Jan

-- 
Siemens AG, T RDA IOT
Corporate Competence Center Embedded Linux