[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87pn4mi23u.fsf@nanos.tec.linutronix.de>
Date: Mon, 09 Nov 2020 23:42:29 +0100
From: Thomas Gleixner <tglx@...utronix.de>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: "Raj\, Ashok" <ashok.raj@...el.com>,
Dan Williams <dan.j.williams@...el.com>,
"Tian\, Kevin" <kevin.tian@...el.com>,
"Jiang\, Dave" <dave.jiang@...el.com>,
Bjorn Helgaas <helgaas@...nel.org>,
"vkoul\@kernel.org" <vkoul@...nel.org>,
"Dey\, Megha" <megha.dey@...el.com>,
"maz\@kernel.org" <maz@...nel.org>,
"bhelgaas\@google.com" <bhelgaas@...gle.com>,
"alex.williamson\@redhat.com" <alex.williamson@...hat.com>,
"Pan\, Jacob jun" <jacob.jun.pan@...el.com>,
"Liu\, Yi L" <yi.l.liu@...el.com>,
"Lu\, Baolu" <baolu.lu@...el.com>,
"Kumar\, Sanjay K" <sanjay.k.kumar@...el.com>,
"Luck\, Tony" <tony.luck@...el.com>,
"kwankhede\@nvidia.com" <kwankhede@...dia.com>,
"eric.auger\@redhat.com" <eric.auger@...hat.com>,
"parav\@mellanox.com" <parav@...lanox.com>,
"rafael\@kernel.org" <rafael@...nel.org>,
"netanelg\@mellanox.com" <netanelg@...lanox.com>,
"shahafs\@mellanox.com" <shahafs@...lanox.com>,
"yan.y.zhao\@linux.intel.com" <yan.y.zhao@...ux.intel.com>,
"pbonzini\@redhat.com" <pbonzini@...hat.com>,
"Ortiz\, Samuel" <samuel.ortiz@...el.com>,
"Hossain\, Mona" <mona.hossain@...el.com>,
"dmaengine\@vger.kernel.org" <dmaengine@...r.kernel.org>,
"linux-kernel\@vger.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-pci\@vger.kernel.org" <linux-pci@...r.kernel.org>,
"kvm\@vger.kernel.org" <kvm@...r.kernel.org>
Subject: Re: [PATCH v4 06/17] PCI: add SIOV and IMS capability detection
On Mon, Nov 09 2020 at 13:30, Jason Gunthorpe wrote:
> On Mon, Nov 09, 2020 at 12:21:22PM +0100, Thomas Gleixner wrote:
>> >> Is the IOMMU/Interrupt remapping unit able to catch such messages which
>> >> go outside the space to which the guest is allowed to signal to? If yes,
>> >> problem solved. If no, then IMS storage in guest memory can't ever
>> >> work.
> The SIOV case is to take a single RID and split it to multiple
> VMs and also to the hypervisor. All these things concurrently use the
> same RID, and the IOMMU can't tell them apart.
>
> The hypervisor security domain owns TLPs with no PASID. Each PASID is
> assigned to a VM.
>
> For interrupts, today, they are all generated, with no PASID, to the
> same RID. There is no way for remapping to protect against a guest
> without checking also PASID.
>
> The relavance of PASID is this:
>
>> Again, trap emulate does not work for IMS when the IMS store is software
>> managed guest memory and not part of the device. And that's the whole
>> reason why we are discussing this.
>
> With PASID tagged interrupts and a IOMMU interrupt remapping
> capability that can trigger on PASID, then the platform can provide
> the same level of security as SRIOV - the above is no problem.
>
> The device ensures that all DMAs and all interrupts program by the
> guest are PASID tagged and the platform provides security by checking
> the PASID when delivering the interrupt.
Correct.
> Intel IOMMU doesn't work this way today, but it makes alot of design
> sense.
Right.
> Otherwise the interrupt is effectively delivered to the hypervisor. A
> secure device can *never* allow a guest to specify an addr/data pair
> for a non-PASID tagged TLP, so the device cannot offer IMS to the
> guest.
Ok. Let me summarize the current state of supported scenarios:
1) SRIOV works with any form of IMS storage because it does not require
PASID and the VF devices have unique requester ids, which allows the
remap unit to sanity check the message.
2) SIOV with IMS when the hypervisor can manage the IMS store
exclusively.
So #2 prevents a device which handles IMS storage in queue memory to
utilize IMS for SIOV in a guest because the hypervisor cannot manage the
IMS message store and the guest can write arbitrary crap to it which
violates the isolation principle.
And here is the relevant part of the SIOV spec:
"IMS is managed by host driver software and is not accessible directly
from guest or user-mode drivers.
Within the device, IMS storage is not accessible from the ADIs. ADIs
can request interrupt generation only through the device’s ‘Interrupt
Message Generation Logic’, which allows an ADI to only generate
interrupt messages that are associated with that specific ADI. These
restrictions ensure that the host driver has complete control over
which interrupt messages can be generated by each ADI.
On Intel 64 architecture platforms, message signaled interrupts are
issued as DWORD size untranslated memory writes without a PASID TLP
Prefix, to address range 0xFEExxxxx. Since all memory requests
generated by ADIs include a PASID TLP Prefix, it is not possible for
an ADI to generate a DMA write that would be interpreted by the
platform as an interrupt message."
That's the reductio ad absurdum for this sentence in the first paragraph
of the preceding chapter describing the concept of IMS:
"IMS enables devices to store the interrupt messages for ADIs in a
device-specific optimized manner without the scalability restrictions
of the PCI Express defined MSI-X capability."
"Device-specific optimized manner" is either wishful thinking or
marketing induced verbal diarrhoea.
The current specification puts massive restrictions on IMS storage which
are _not_ allowing to optimize it in a device specific manner as
demonstrated in this discussion.
It also precludes obvious use cases like passing a full device to a
guest and let the guest manage SIOV subdevices for containers or nested
guests.
TBH, to me this is just another hastily cobbled together half thought
out misfeature cast in silicon. The proposed software support is
following the exactly same principle.
So before we go anywhere with this, I want to see a proper way forward
to support _all_ sensible use cases and to fulfil the promise of
"device-specific optimized manner" at the conceptual and specification
and also at the code level.
I'm not at all interested to rush in support for a half baken Intel
centric solution which other people have to clean up after the fact
(again).
IOW, it's time to go back to the drawing board.
Thanks,
tglx
Powered by blists - more mailing lists