linux-kernel - Re: [EXTERNAL] [PATCH RFC 12/39] KVM: x86/xen: store virq when assigning evtchn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <776740ea7c05a6c17fa05a809b4cbeb824b5afa2.camel@infradead.org>
Date:   Thu, 10 Feb 2022 15:23:10 +0000
From:   David Woodhouse <dwmw2@...radead.org>
To:     Joao Martins <joao.m.martins@...cle.com>
Cc:     "bp@...en8.de" <bp@...en8.de>, "x86@...nel.org" <x86@...nel.org>,
        "boris.ostrovsky@...cle.com" <boris.ostrovsky@...cle.com>,
        "hpa@...or.com" <hpa@...or.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "tglx@...utronix.de" <tglx@...utronix.de>,
        "ankur.a.arora@...cle.com" <ankur.a.arora@...cle.com>,
        "rkrcmar@...hat.com" <rkrcmar@...hat.com>,
        "pbonzini@...hat.com" <pbonzini@...hat.com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [EXTERNAL] [PATCH RFC 12/39] KVM: x86/xen: store virq when
 assigning evtchn

On Thu, 2022-02-10 at 12:17 +0000, Joao Martins wrote:
> On 2/8/22 16:17, Woodhouse, David wrote:
> > And then we have the *outbound* events, which the guest can invoke with
> > the EVTCHNOP_send hypercall. Those are either:
> >  • IPI, raising the same port# on the guest
> >  • Interdomain looped back to a different port# on the guest
> >  • Interdomain triggering an eventfd.
> > 
> 
> /me nods
> 
> I am forgetting why you one do this on Xen:
> 
> * Interdomain looped back to a different port# on the guest

It's one of the few things we had to fix up when we started running PV
guests in the 'shim' under KVM. I don't know that it actually sends
loopback events via the true Xen (or KVM) host, but it does at least
register them so that the port# is 'reserved' and the host won't
allocate that port for anything else. It does it at least for the
console port.

For the inbound vs. outbound thing.... I did ponder a really simple API
design in which outbound ports are *only* ever associated with an
eventfd, and for IPIs the VMM would be expected to bind those as IRQFD
to an inbound event on the same port#.

You pointed out that it was quite inefficient, but... we already have
complex hacks to bypass the eventfd for posted interrupts when the
source and destination "match", and perhaps we could do something
similar to allow EVTCHNOP_send to deliver directly to a local port#
without having to go through all the eventfd code?

But the implementation of that would end up being awful, *and* the
userspace API isn't even that nice despite being "simple", because it
would force userspace to allocate a whole bunch of eventfds and use
space in the IRQ routing table for them. So it didn't seem worth it.
Let's just let userspace tell us explicitly the vcpu/port/prio instead
of having to jump through hoops to magically work it out from matching
eventfds.

> > In the last case, that eventfd can be set up with IRQFD for direct
> > event channel delivery to a different KVM/Xen guest.
> > 
> > I've used your implemention, with an idr for the outbound port# space
> > intercepting EVTCHNOP_send for known ports and only letting userspace
> > see the hypercall if it's for a port# the kernel doesn't know. Looks a
> > bit like
> > https://git.infradead.org/users/dwmw2/linux.git/commitdiff/b4fbc49218a
> > 
> > 
> > 
> > But I *don't* want to do the VIRQ part shown above, "spotting" the VIRQ
> > in that outbound port# space and squirreling the information away into
> > the kvm_vcpu for when we need to deliver a timer event.
> > 
> > The VIRQ isn't part of the *outbound* port# space; it isn't a port to
> > which a Xen guest can use EVTCHNOP_send to send an event.
> 
> But it is still an event channel which port is unique regardless of port
> type/space hence (...)
> 
> > If anything,
> > it would be part of the *inbound* port# space, in the KVM IRQ routing
> > table. So perhaps we could have a similar snippet in
> > kvm_xen_setup_evtchn() which spots a VIRQ and says "aha, now I know
> > where to deliver timer events for this vCPU".
> > 
> 
> (...) The thinking at the time was mainly simplicity so our way of saying
> 'offload the evtchn to KVM' was through the machinery that offloads the outbound
> part (using your terminology). I don't think even using XEN_EVENTFD as proposed
> here that that one could send an VIRQ via EVTCHNOP_send (I could be wrong as
> it has been a long time).

I confess I didn't test it but it *looked* like you could, while true
Xen wouldn't permit that.

> Regardless, I think you have a good point to split the semantics and (...)

> > 
> > So I think I'm going to make the timer VIRQ (port#, priority) into an
> > explicit KVM_XEN_VCPU_ATTR_TYPE.
> 
> (...) thus this makes sense. Do you particularly care about
> VIRQ_DEBUG?


Not really. Especially not as something to accelerate in KVM.

Our environment doesn't have any way to deliver that to guests,
although we *do* have an API call to deliver "diagnostic interrupt"
which maps to an NMI, and we *have* occasionally hacked the VMM to
deliver VIRQ_DEBUG to Xen guests instead of that NMI. Mostly back when
I was busy being confused about ->vcpu_id vs. ->vcpu_idx vs. the
Xen/ACPI CPU# and where the hell my interrupts were going.

> > Along with the *actual* timer expiry,
> > which we need to extract/restore for LU/LM too, don't we?
> > /me nods
> 
> I haven't thought that one well for Live Update / Live Migration, but
> I wonder if wouldn't be better to be instead a general 'xen state'
> attr type should you need more than just pending timers expiry. Albeit
> considering that the VMM has everything it needs (?), perhaps for Xen PV
> timer look to be the oddball missing, and we donºt need to go that extent.

Yeah, the VMM knows most of this stuff already, as it *told* the kernel
in the first place. Userspace is still responsible for all the setup
and admin, and the kernel just handles the bare minimum of the fast
path.

So on live update/migrate we only really need to read out the runstate
data from the kernel... and now the current timer expiry. 

On the *resume* side it's still lots of syscalls, and perhaps in the
end we might decide we want to do a KVM_XEN_HVM_SET_ATTR_MULTI which
takes an array of them? But I think that's a premature optimisation for
now.

Download attachment "smime.p7s" of type "application/pkcs7-signature" (5965 bytes)