linux-kernel - RE: [RESEND PATCH 5/6] KVM: x86/VMX: add kvm_vmx_reinject_nmi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN6PR1101MB21611347D37CF40403B974EDA8099@BN6PR1101MB2161.namprd11.prod.outlook.com>
Date:   Fri, 18 Nov 2022 00:05:53 +0000
From:   "Li, Xin3" <xin3.li@...el.com>
To:     "Christopherson,, Sean" <seanjc@...gle.com>
CC:     Peter Zijlstra <peterz@...radead.org>,
        Paolo Bonzini <pbonzini@...hat.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "x86@...nel.org" <x86@...nel.org>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "Tian, Kevin" <kevin.tian@...el.com>
Subject: RE: [RESEND PATCH 5/6] KVM: x86/VMX: add kvm_vmx_reinject_nmi_irq()
 for NMI/IRQ reinjection

> > > > > > > But what about NMIs, afaict this is all horribly broken for NMIs.
> > > > > > > So the whole VMX thing latches the NMI (which stops NMI
> > > > > > > recursion), right?
> > > > > > >
> > > > > > > But then you drop out of noinstr code, which means any
> > > > > > > random exception can happen (kprobes #BP, hw_breakpoint #DB,
> > > > > > > or even #PF due to random nonsense like *SAN). This
> > > > > > > exception will do IRET and clear the NMI latch, all before
> > > > > > > you get to run any of the NMI code.
> > > > > >
> > > > > > What you said here implies that we have this problem in the existing
> code.
> > > > > > Because a fake iret stack is created to call the NMI handler
> > > > > > in the IDT NMI descriptor, which lastly executes the IRET instruction.
> > > > >
> > > > > I can't follow; of course the IDT handler terminates with IRET, it has to
> no?
> > > > >
> > > > > And yes, the current code appears to suffer the same defect.
> 
> That defect isn't going to be fixed simply by changing how KVM forwards NMIs
> though.  IIUC, _everything_ between VM-Exit and the invocation of the NMI
> handler needs to be noinstr.  On VM-Exit due to NMI, NMIs are blocked.  If a
> #BP/#DB/#PF occurs before KVM gets to kvm_x86_handle_exit_irqoff(), the
> subsequent IRET will unblock NMIs before the original NMI is serviced, i.e. a
> second NMI could come in at anytime regardless of how KVM forwards the NMI
> to the kernel.
> 
> Is there any way to solve this without tagging everything noinstr?  There is a
> metric shit ton of code between VM-Exit and the handling of NMIs, and much
> of that code is common helpers.  It might be possible to hoist NMI handler
> much earlier, though we'd need to do a super thorough audit to ensure all
> necessary host state is restored.

As NMI is the only vector with this potential issue, it sounds a good idea
to only promote its handling.


> > > > With FRED, ERETS/ERETU replace IRET, and use bit 28 of the popped
> > > > CS field to control whether to unblock NMI. If bit 28 of the field
> > > > (above the selector) is 1, ERETS/ERETU unblocks NMIs.
> 
> Side topic, there's a bug in the ISE docs.  Section "9.4.2 NMI Blocking" states
> that bit 16 holds the "unblock NMI" magic, which I'm guessing is a holdover
> from an earlier revision of FRED.

Good catch, the latest 3.0 draft spec changed it to bit 28, but section 9.4.2
didn't get a proper update.

> 
>   As specified in Section 6.1.3 and Section 6.2.3, ERETS and ERETU each
> unblocks NMIs
>   if bit 16 of the popped CS field is 1. The following items detail how this
> behavior may be
>   changed in VMX non-root operation, depending on the settings of certain VM-
> execution
>   controls:
> 
> > > Yes, I know that. It is one of the many improvements FRED brings.
> > > Ideally the IBT WAIT-FOR-ENDBR state also gets squirreled away in
> > > the hardware exception frame, but that's still up in the air I
> > > believe :/
> > >
> > > Anyway.. given there is interrupt priority and NMI is pretty much on
> > > top of everything else the reinject crap *should* run NMI first.
> > > That way NMI runs with the latch disabled and whatever other pending
> interrupts will run later.
> > >
> > > But that all is still broken because afaict the current code also
> > > leaves noinstr -- and once you leave noinstr (or use a static_key,
> > > static_call or anything else that
> > > *can* change at runtime) you can't guarantee nothing.
> >
> > For NMI, HPA asked me to use "int $2", as it switches to the NMI IST
> > stack to execute the NMI handler, essentially like how HW deals with a
> > NMI in host. and I tested it with NMI watchdog, it looks working fine.
> 
> Heh, well yeah, because that's how KVM used to handle NMIs back before I
> reworked NMI handling to use the direct call method.  Ironically, that original
> change was done in part to try and make it _easier_ to deal with FRED (back
> before FRED was publicly disclosed).
> 
> If KVM reverts to INTn, the fix to route KVM=>NMI through the non-IST entry
> can be reverted too.
> 
>   a217a6593cec ("KVM/VMX: Invoke NMI non-IST entry instead of IST entry")
>   1a5488ef0dcf ("KVM: VMX: Invoke NMI handler via indirect call instead of
> INTn")

Sure, I'm just thinking to put asm("int $2") in a new function exc_raise_nmi()
defined in arch/x86/kernel/traps.c. Thus we will need to change it only
whenever we have any better facility, and no KVM VMX change required.