[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210331173148.GN1463678@nvidia.com>
Date: Wed, 31 Mar 2021 14:31:48 -0300
From: Jason Gunthorpe <jgg@...dia.com>
To: Jacob Pan <jacob.jun.pan@...ux.intel.com>
Cc: Jean-Philippe Brucker <jean-philippe@...aro.org>,
LKML <linux-kernel@...r.kernel.org>,
Joerg Roedel <joro@...tes.org>,
Lu Baolu <baolu.lu@...ux.intel.com>,
David Woodhouse <dwmw2@...radead.org>,
iommu@...ts.linux-foundation.org, cgroups@...r.kernel.org,
Tejun Heo <tj@...nel.org>, Li Zefan <lizefan@...wei.com>,
Johannes Weiner <hannes@...xchg.org>,
Jean-Philippe Brucker <jean-philippe@...aro.com>,
Alex Williamson <alex.williamson@...hat.com>,
Eric Auger <eric.auger@...hat.com>,
Jonathan Corbet <corbet@....net>,
Raj Ashok <ashok.raj@...el.com>,
"Tian, Kevin" <kevin.tian@...el.com>, Yi Liu <yi.l.liu@...el.com>,
Wu Hao <hao.wu@...el.com>, Dave Jiang <dave.jiang@...el.com>
Subject: Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and
allocation APIs
On Wed, Mar 31, 2021 at 09:34:57AM -0700, Jacob Pan wrote:
> "3.3 PASID translation
> To support PASID isolation for Shared Work Queues used by VMs, the CPU must
> provide a way for the PASID to be communicated to the device in the DMWr
> transaction. On Intel CPUs, the CPU provides a PASID translation table in
> the vCPUs virtual machine control structures. During ENQCMD/ENQCMDS
> instruction execution in a VM, the PASID translation table is used by the
> CPU to replace the guest PASID in the work descriptor with a host PASID
> before the descriptor is sent to the device.3.3 PASID translation"
Yikes, a special ENQCMD table in the hypervisor!
Still, pass the /dev/ioasid into a KVM IOCTL and tell it to populate
this table. KVM only adds to the table when userspace presents a
/dev/ioasid FD.
> > Doesn't work submission go either to the mdev driver or through the
> > secure PASID of #1?
>
> No, once a PASID is bound with IOMMU, KVM, and the mdev, work
> submission is all done in HW. But I don't think this will change
> for either uAPI design.
The big note here is "only for things that use ENQCMD" and that is
basically nothing these days.
> > Everything should revolve around the /dev/ioasid FD. qemu should pass
> > it to all places that need to know about PASID's in the VM.
>
> I guess we need to extend KVM interface to support PASIDs. Our original
> intention was to avoid introducing new interfaces.
New features need new interfaces, especially if there is a security
sensitivity! KVM should *not* automatically opt into security
sensitive stuff without being explicitly told what to do.
Here you'd need to authorized *two* things for IDXD:
- The mdev needs to be told it is allowed to use PASID, this tells
the IOMMU driver to connect the pci device under the mdev
- KVM needs to be told to populate a vPASID to the 'ENQCMD'
security table translated to a physical PASID.
If qemu doesn't explicitly enable the ENQCMD security table it should
be *left disabled* by KVM - even if someone else is using PASID in the
same process. And the API should be narrow like this just to the
EQNCMD table as who knows what will come down the road, or how it will
work.
Having a PASID wrongly leak out into the VM would be a security
disaster. Be explicit.
> > We should try to avoid hidden behind the scenes kernel
> > interconnections between subsystems.
> >
> Can we? in case of exception. Since all these IOCTLs are coming from the
> unreliable user space, we must deal all exceptions.
>
> For example, when user closes /dev/ioasid FD before (or w/o) unbind IOCTL
> for VFIO, KVM, kernel must do cleanup and coordinate among subsystems.
> In this patchset, we have a per mm(ioasid_set) notifier to inform mdev, KVM
> to clean up and drop its refcount. Do you have any suggestion on this?
The ioasid should be a reference counted object.
When the FD is closed, or the ioasid is "destroyed" it just blocks DMA
and parks the PASID until *all* places release it. Upon a zero
refcount the PASID is recycled for future use.
The duration between unmapping the ioasid and releasing all HW access
will have HW see PCIE TLP errors due to the blocked access. If
userspace messes up the order it is fine to cause this. We already had
this dicussion when talking about how to deal with process exit in the
simple SVA case.
> Let me try to paraphrase, you are suggesting common helper code and data
> format but still driver specific storage of the mapping, correct?
The driver just needs to hold the datastructure in its memory.
Like an xarray, the driver can have an xarray inside its struct
device, but the xarray library provides all the implementation.
Jason
Powered by blists - more mailing lists