[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d69953378bd1fdcdda54a2fbe285f6c0b1484e8a.camel@infradead.org>
Date: Sun, 08 Nov 2020 19:36:34 +0000
From: David Woodhouse <dwmw2@...radead.org>
To: Thomas Gleixner <tglx@...utronix.de>,
Jason Gunthorpe <jgg@...dia.com>,
Dan Williams <dan.j.williams@...el.com>
Cc: "Raj, Ashok" <ashok.raj@...el.com>,
"Tian, Kevin" <kevin.tian@...el.com>,
"Jiang, Dave" <dave.jiang@...el.com>,
Bjorn Helgaas <helgaas@...nel.org>,
"vkoul@...nel.org" <vkoul@...nel.org>,
"Dey, Megha" <megha.dey@...el.com>,
"maz@...nel.org" <maz@...nel.org>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>,
"alex.williamson@...hat.com" <alex.williamson@...hat.com>,
"Pan, Jacob jun" <jacob.jun.pan@...el.com>,
"Liu, Yi L" <yi.l.liu@...el.com>, "Lu, Baolu" <baolu.lu@...el.com>,
"Kumar, Sanjay K" <sanjay.k.kumar@...el.com>,
"Luck, Tony" <tony.luck@...el.com>,
"jing.lin@...el.com" <jing.lin@...el.com>,
"kwankhede@...dia.com" <kwankhede@...dia.com>,
"eric.auger@...hat.com" <eric.auger@...hat.com>,
"parav@...lanox.com" <parav@...lanox.com>,
"rafael@...nel.org" <rafael@...nel.org>,
"netanelg@...lanox.com" <netanelg@...lanox.com>,
"shahafs@...lanox.com" <shahafs@...lanox.com>,
"yan.y.zhao@...ux.intel.com" <yan.y.zhao@...ux.intel.com>,
"pbonzini@...hat.com" <pbonzini@...hat.com>,
"Ortiz, Samuel" <samuel.ortiz@...el.com>,
"Hossain, Mona" <mona.hossain@...el.com>,
"dmaengine@...r.kernel.org" <dmaengine@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
On Sun, 2020-11-08 at 19:47 +0100, Thomas Gleixner wrote:
> This only works when the guest OS actually knows that it runs in a
> VM. If the guest can't figure that out, i.e. via CPUID, this cannot be
> solved because from the guest OS view that's the same as running on bare
> metal. Obviously on bare metal the Vector domain can and must handle
> this.
>
> So this needs some thought.
The problem here is that Intel implemented interrupt remapping in a way
which is anathema to structured, ordered IRQ domains.
When a guest writes an MSI message (addr/data) to the MSI table of a
PCI device which has been assigned to that guest, it *doesn't* properly
inherit the MSI composition from a parent irqdomain which knows about
the (host-side) IOMMU.
What actually happens is the hypervisor *traps* the writes to the
device's MSI table, and translates them *then*. In *precisely* the
fashion which we're trying to avoid for IMS.
Now, you can imagine a world where it wasn't like this, where
Remappable Format MSI messages don't exist, and where we let guests
write native MSI message to the device without trapping — and where the
IOMMU then sees the incoming interrupt and has to map the APIC ID to a
*virtual* CPU for that guest, based on the PCI source-id of the device.
In that world, IMS would work naturally. But that isn't how Intel
designed interrupt remapping. They *designed* to have to trap and
translate as the message is written to the device.
So it does look like we're going to need a hypercall interface to
compose an MSI message on behalf of the guest, for IMS to use. In fact
PCI devices assigned to a guest could use that too, and then we'd only
need to trap-and-remap any attempt to write a Compatibility Format MSI
to the device's MSI table, while letting Remappable Format messages get
written directly.
We'd also need a way for an OS running on bare metal to *know* that
it's on bare metal and can just compose MSI messages for itself. Since
we do expect bare metal to have an IOMMU, perhaps that is just a
feature flag on the IOMMU?
That or Intel needs to fix the IOMMU to do proper virtualisation and
actually translate "Compatibility Format" MSIs for a guest too.
Download attachment "smime.p7s" of type "application/x-pkcs7-signature" (5174 bytes)
Powered by blists - more mailing lists