linux-kernel - RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <MWHPR11MB18861FE6982D73AFBF173E048C439@MWHPR11MB1886.namprd11.prod.outlook.com>
Date:   Sun, 25 Apr 2021 09:24:46 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     Jason Gunthorpe <jgg@...dia.com>
CC:     Alex Williamson <alex.williamson@...hat.com>,
        "Liu, Yi L" <yi.l.liu@...el.com>,
        Jacob Pan <jacob.jun.pan@...ux.intel.com>,
        Auger Eric <eric.auger@...hat.com>,
        Jean-Philippe Brucker <jean-philippe@...aro.org>,
        LKML <linux-kernel@...r.kernel.org>,
        Joerg Roedel <joro@...tes.org>,
        Lu Baolu <baolu.lu@...ux.intel.com>,
        David Woodhouse <dwmw2@...radead.org>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        Tejun Heo <tj@...nel.org>, Li Zefan <lizefan@...wei.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Jean-Philippe Brucker <jean-philippe@...aro.com>,
        Jonathan Corbet <corbet@....net>,
        "Raj, Ashok" <ashok.raj@...el.com>, "Wu, Hao" <hao.wu@...el.com>,
        "Jiang, Dave" <dave.jiang@...el.com>
Subject: RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation
 APIs

> From: Jason Gunthorpe <jgg@...dia.com>
> Sent: Friday, April 23, 2021 7:50 PM
> 
> On Fri, Apr 23, 2021 at 09:06:44AM +0000, Tian, Kevin wrote:
> 
> > Or could we still have just one /dev/ioasid but allow userspace to create
> > multiple gpa_ioasid_id's each associated to a different iommu domain?
> > Then the compatibility check will be done at ATTACH_IOASID instead of
> > JOIN_IOASID_FD.
> 
> To my mind what makes sense that that /dev/ioasid presents a single
> IOMMU behavior that is basically the same. This may ultimately not be
> what we call a domain today.
> 
> We may end up with a middle object which is a group of domains that
> all have the same capabilities, and we define capabilities in a way
> that most platforms have a single group of domains.
> 
> The key capability of a group of domains is they can all share the HW
> page table representation, so if an IOASID instantiates a page table
> it can be assigned to any device on any domain in the gruop of domains.

Sorry that I didn't quite get it. If a group of domains can share the 
same page table then why not just attaching all devices under those
domains into a single domain? IMO the iommu domain is introduced
to describe the HW page table. Ideally a new iommu domain should
be created only when it's impossible to share an existing page table. 
Otherwise you'll get bad iotlb efficiency because each domain has its
unique domain id (tagged in iotlb) then duplicated iotlb entries may
exist even when a single page table is shared by those domains.

Then does it imply that you are actually suggesting /dev/ioasid associated 
with a single 2nd-level page table (which can be nested by multiple 
1st-level page tables represented by other ioasids) thus a single iommu 
domain for all devices linked to compatible IOMMUs?

Or, can you elaborate what is the targeted usage by having a group of
domains which all share the same page table?

> 
> If you try to say that /dev/ioasid has many domains and they can't
> have their HW page tables shared then I think the implementation
> complexity will explode.

Want to hear your opinion for one open here. There is no doubt that
an ioasid represents a HW page table when the table is constructed by 
userspace and then linked to the IOMMU through the bind/unbind
API. But I'm not very sure about whether an ioasid should represent 
the exact pgtable or the mapping metadata when the underlying 
pgtable is indirectly constructed through map/unmap API. VFIO does
the latter way, which is why it allows multiple incompatible domains
in a single container which all share the same mapping metadata.

> 
> > This does impose one burden to userspace though, to understand the
> > IOMMU compatibilities and figure out which incompatible features may
> > affect the page table management (while such knowledge is IOMMU
> > vendor specific) and then explicitly manage multiple /dev/ioasid's or
> > multiple gpa_ioasid_id's.
> 
> Right, this seems very hard in the general case..
> 
> > Alternatively is it a good design by having the kernel return error at
> > attach/join time to indicate that incompatibility is detected then the
> > userspace should open a new /dev/ioasid or creates a new gpa_ioasid_id
> > for the failing device upon such failure, w/o constructing its own
> > compatibility knowledge?
> 
> Yes, this feels workable too
> 
> > > This means qemue might have multiple /dev/ioasid's if the system has
> > > multiple incompatible IOMMUs (is this actually a thing?) The platform
> >
> > One example is Intel platform with igd. Typically there is one IOMMU
> > dedicated for igd and the other IOMMU serving all the remaining devices.
> > The igd IOMMU may not support IOMMU_CACHE while the other one
> > does.
> 
> If we can do as above the two domains may be in the same group of
> domains and the IOMMU_CACHE is not exposed at the /dev/ioasid level.
> 
> For instance the API could specifiy IOMMU_CACHE during attach, not
> during IOASID creation.
> 
> Getting all the data model right in the API is going to be trickiest
> part of this.
> 
> > yes, e.g. in vSVA both devices (behind divergence IOMMUs) are bound
> > to a single guest process which has an unique PASID and 1st-level page
> > table. Earlier incompatibility example is only for 2nd-level.
> 
> Because when we get to here, things become inscrutable as an API if
> you are trying to say two different IOMMU presentations can actually
> be nested.
> 
> > > Sure.. The tricky bit will be to define both of the common nested
> > > operating modes.
> > >
> > >   nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID,
> gpa_ioasid_id);
> > >   ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
> > >
> > >    // IOMMU will match on the device RID, no PASID:
> > >   ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
> > >
> > >    // IOMMU will match on the device RID and PASID:
> > >   ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
> >
> > I'm a bit confused here why we have both pasid and ioasid notations
> together.
> > Why not use nested_ioasid as pasid directly (i.e. every pasid in nested
> mode
> > is created by CREATE_NESTED_IOASID)?
> 
> The IOASID is not a PASID, it is just a page table.
> 
> A generic IOMMU matches on either RID or (RID,PASID), so you should
> specify the PASID when establishing the match.
> 
> IOASID only specifies the page table.
> 
> So you read the above as configuring the path
> 
>   PCI_DEVICE -> (RID,PASID) -> nested_ioasid -> gpa_ioasid_id -> physical
> 
> Where (RID,PASID) indicate values taken from the PCI packet.
> 
> In principle the IOMMU could also be commanded to reuse the same
> ioasid page table with a different PASID:
> 
>   PCI_DEVICE_B -> (RID_B,PASID_B) -> nested_ioasid -> gpa_ioasid_id ->
> physical
> 
> This is impossible if the ioasid == PASID in the API.

OK, now I see where the disconnection comes from. In my context ioasid
is the identifier that is actually used in the wire, but seems you treat it as 
a sw-defined namespace purely for representing page tables. We should 
clear this concept first before further discussing other details. 😊

Below is the description when the kernel ioasid allocator was introduced:

--
commit fa83433c92e340822a056a610a4fa2063a3db304
Author: Jean-Philippe Brucker <jean-philippe.brucker@....com>
Date:   Wed Oct 2 12:42:41 2019 -0700

    iommu: Add I/O ASID allocator

    Some devices might support multiple DMA address spaces, in particular
    those that have the PCI PASID feature. PASID (Process Address Space ID)
    allows to share process address spaces with devices (SVA), partition a
    device into VM-assignable entities (VFIO mdev) or simply provide
    multiple DMA address space to kernel drivers. Add a global PASID
    allocator usable by different drivers at the same time. Name it I/O ASID
    to avoid confusion with ASIDs allocated by arch code, which are usually
    a separate ID space.

    The IOASID space is global. Each device can have its own PASID space,
    but by convention the IOMMU ended up having a global PASID space, so
    that with SVA, each mm_struct is associated to a single PASID.

    The allocator is primarily used by IOMMU subsystem but in rare occasions
    drivers would like to allocate PASIDs for devices that aren't managed by
    an IOMMU, using the same ID space as IOMMU.
--

ioasid and pasid are used interchangeably within the kernel, and the ioasid
value returned by the ioasid allocator is directly used as PASID when the
driver programs IOMMU and device, respectively. My context is based on 
this understanding, which is why I thought nested_ioasid can be directly 
used as PASID in earlier reply. Do you see a problem with this approach?

Then following your proposal, does it mean that we need another interface
for allocating PASID? and since ioasid means different thing in uAPI and
in-kernel API, possibly a new name is required to avoid confusion?

> 
> > Below I list different scenarios for ATTACH_IOASID in my view. Here
> > vfio_device could be a real PCI function (RID), or a subfunction device
> > (RID+def_ioasid).
> 
> What is RID+def_ioasid? The IOMMU does not match on IOASID's.
> 
> A subfunction device always need to use PASID, or an internal IOMMU,
> confused what you are trying to explain?

Here the def_ioasid is the PASID that is associated with the subfunction
in my context with above explanation.

> 
> > If the whole PASID table is delegated to the guest in ARM case, the guest
> > can select its own PASIDs w/o telling the hypervisor.
> 
> The hypervisor has to route the PASID's to the guest at some point - a
> guest can't just claim a PASID unilaterally, that would not be secure.
> 
> If it is not done with per-PASID hypercalls then the hypervisor has to
> route all PASID's for a RID to the guest and /dev/ioasid needs to have
> a nested IOASID object that represents this connection - ie it points
> to the PASID table of the guest vIOMMU or something.

yes, this might be the model that will work for ARM case. In their
architecture the PASID table locates in the GPA space thus naturally
to be managed by the guest (though Jean ever mentioned some
tricky method to allow the host managing it by stealing a GPA window).

> 
> Remember this all has to be compatible with mdev's too and without
> hypercalls to create PASIDs that will be hard: mdev sharing a RID and
> slicing the physical PASIDs can't support a 'send all PASIDs to the
> guest' model, or even a 'the guest gets to pick the PASID' option.
> 

yes, with mdev above is inevitable. I guess ARM may need some
similar extension in their SMMU as what VT-d provide for mdev
usage, e.g. at least not mandating PASID table in GPA space. But
before that they may still expect a way to delegate the whole
PASID space per RID to the guest.

Really lots of subtle differences to be generalized...

Thanks
Kevin