linux-kernel - RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR11MB188698FBEE62AF1313E0F7AC8C569@MWHPR11MB1886.namprd11.prod.outlook.com>
Date:   Sat, 8 May 2021 09:56:59 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     "Raj, Ashok" <ashok.raj@...el.com>,
        Jason Gunthorpe <jgg@...dia.com>
CC:     Jean-Philippe Brucker <jean-philippe@...aro.org>,
        Jacob Pan <jacob.jun.pan@...ux.intel.com>,
        Alex Williamson <alex.williamson@...hat.com>,
        "Liu, Yi L" <yi.l.liu@...el.com>,
        Auger Eric <eric.auger@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Joerg Roedel <joro@...tes.org>,
        Lu Baolu <baolu.lu@...ux.intel.com>,
        David Woodhouse <dwmw2@...radead.org>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        Tejun Heo <tj@...nel.org>, Li Zefan <lizefan@...wei.com>,
        Johannes Weiner <hannes@...xchg.org>,
        "Jean-Philippe Brucker" <jean-philippe@...aro.com>,
        Jonathan Corbet <corbet@....net>, "Wu, Hao" <hao.wu@...el.com>,
        "Jiang, Dave" <dave.jiang@...el.com>
Subject: RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation
 APIs

> From: Raj, Ashok <ashok.raj@...el.com>
> Sent: Friday, May 7, 2021 12:33 AM
> 
> > Basically it means when the guest's top level IOASID is created for
> > nesting that IOASID claims all PASID's on the RID and excludes any
> > PASID IOASIDs from existing on the RID now or in future.
> 
> The way to look at it this is as follows:
> 
> For platforms that do not have a need to support shared work queue model
> support for ENQCMD or similar, PASID space is naturally per RID. There is no
> complication with this. Every RID has the full range of PASID's and no need
> for host to track which PASIDs are allocated now or in future in the guest.
> 
> For platforms that support ENQCMD, it is required to mandate PASIDs are
> global across the entire system. Maybe its better to call them gPASID for
> guest and hPASID for host. Short reason being gPASID->hPASID is a guest
> wide mapping for ENQCMD and not a per-RID based mapping. (We covered
> that
> in earlier responses)
> 
> In our current implementation we actually don't separate this space, and
> gPASID == hPASID. The iommu driver enforces that by using the custom
> allocator and the architected interface that allows all guest vIOMMU
> allocations to be proxied to host. Nothing but a glorified hypercall like
> interface. In fact some OS's do use hypercall to get a hPASID vs using
> the vCMD style interface.
> 

After more thinking about the new interface, I feel gPASID==hPASID 
actually causes some confusion in uAPI design. In concept an ioasid
is not active until it's attached to a device, because it's just an ID
if w/o a device. So supposedly an ioasid should reject all user commands
before attach. However an guest likely asks for a new gPASID before
attaching it to devices and vIOMMU. if gPASID==hPASID then Qemu 
must request /dev/ioasid to allocate a hw_id for an ioasid which hasn't 
been attached to any device, with the assumption on kernel knowledge 
that this hw_id is from an global allocator w/o dependency on any 
device. This doesn't sound a clean design, not to say it also conflicts 
with live migration.

Want to hear your and Jason's opinion about an alternative option to 
remove such restriction thus allowing gPASID!=hPASID.

gPASID!=hPASID has a problem when assigning a physical device which 
supports both shared work queue (ENQCMD with PASID in MSR) 
and dedicated work queue (PASID in device register) to a guest
process which is associated to a gPASID. Say the host kernel has setup
the hPASID entry with nested translation though /dev/ioasid. For 
shared work queue the CPU is configured to translate gPASID in MSR 
into **hPASID** before the payload goes out to the wire. However 
for dedicated work queue the device MMIO register is directly mapped 
to and programmed by the guest, thus containing a **gPASID** value
implying DMA requests through this interface will hit IOMMU faults
due to invalid gPASID entry. Having gPASID==hPASID is a simple 
workaround here. mdev doesn't have this problem because the
PASID register is in emulated control-path thus can be translated
to hPASID manually by mdev driver.

Along this story one possible option is having both gPASID and hPASID 
entries pointing to the same paging structure, sort of making gPASID
an aliasing hw_id to hPASID. Then we also need to make sure gPASID
range not colliding with hPASID range for this RID. Something like
below:

In the beginning Qemu specifies a minimal ID (say 1024) that hPASIDs 
must be allocated beyond (sort of delegating [0, 1023] of this RID to
userspace):
	ioctl(ioasid_fd, SET_IOASID_MIN_HWID, 1024);

The guest still uses vIOMMU interface or hypercall to allocate gPASIDs.
Upon such request, Qemu returns a gPASID from [0, 1023] to guest 
and also allocates a new ioasid from /dev/ioasid. there is no hw_id 
allocated at this step:
	ioasid = ioctl(ioasid_fd, ALLOC_IOASID);

hw_id (hPASID) is allocated when attaching ioasid to the said device:
	ioctl(device_fd, VFIO_ATTACH_IOASID, ioasid);

Then gPASID is provided as an aliasing hwid to this ioasid:
	ioctl(device_fd, VFIO_ALIASING_IOASID, ioasid, gPASID);

Starting from this point the kernel should make sure that any ioasid
operation should be applied to both gPASID and hPASID for this RID 
(entry setup, tear down, etc.)  and both PASID entries point to the same
paging structures. When a page fault happens, the IOMMU driver 
should also link a fault from either PASID back to the associated ioasid. 

As explained earlier this aliasing requirement only applies to physical
devices on Intel platform. We may either have mdev ignore such 
aliasing request, or have vfio device to report whether aliasing is allowed.

This is sort of a hybrid model. the gPASID range is reserved locally
in per-RID pasid space and delegated to userspace, while the hPASIDs 
are still managed globally and not exposed to userspace.

Does it sound a cleaner approach (still w/ some complexity) compared
to the restrictions of having gPASID==hPASID?

Thanks
Kevin