linux-kernel - RE: Plan for /dev/ioasid RFC v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN9PR11MB543382665D34E58155A9593C8C039@BN9PR11MB5433.namprd11.prod.outlook.com>
Date:   Mon, 28 Jun 2021 06:45:23 +0000
From:   "Tian, Kevin" <kevin.tian@...el.com>
To:     Jason Gunthorpe <jgg@...dia.com>
CC:     "Alex Williamson (alex.williamson@...hat.com)" 
        <alex.williamson@...hat.com>, Joerg Roedel <joro@...tes.org>,
        "Jean-Philippe Brucker" <jean-philippe@...aro.org>,
        David Gibson <david@...son.dropbear.id.au>,
        Jason Wang <jasowang@...hat.com>,
        "parav@...lanox.com" <parav@...lanox.com>,
        "Enrico Weigelt, metux IT consult" <lkml@...ux.net>,
        Paolo Bonzini <pbonzini@...hat.com>,
        Shenming Lu <lushenming@...wei.com>,
        Eric Auger <eric.auger@...hat.com>,
        Jonathan Corbet <corbet@....net>,
        "Raj, Ashok" <ashok.raj@...el.com>,
        "Liu, Yi L" <yi.l.liu@...el.com>, "Wu, Hao" <hao.wu@...el.com>,
        "Jiang, Dave" <dave.jiang@...el.com>,
        Jacob Pan <jacob.jun.pan@...ux.intel.com>,
        "Kirti Wankhede" <kwankhede@...dia.com>,
        Robin Murphy <robin.murphy@....com>,
        "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
        "iommu@...ts.linux-foundation.org" <iommu@...ts.linux-foundation.org>,
        "David Woodhouse" <dwmw2@...radead.org>,
        LKML <linux-kernel@...r.kernel.org>,
        "Lu Baolu" <baolu.lu@...ux.intel.com>
Subject: RE: Plan for /dev/ioasid RFC v2

> From: Jason Gunthorpe <jgg@...dia.com>
> Sent: Friday, June 25, 2021 10:36 PM
> 
> On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote:
> 
> > -   When receiving the binding call for the 1st device in a group, iommu_fd
> >     calls iommu_group_set_block_dma(group, dev->driver) which does
> >     several things:
> 
> The whole problem here is trying to match this new world where we want
> devices to be in charge of their own IOMMU configuration and the old
> world where groups are in charge.
> 
> Inserting the group fd and then calling a device-centric
> VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> necessary. We can always get the group back from the device at any
> point in the sequence do to a group wide operation.
> 
> What I saw as the appeal of the sort of idea was to just completely
> leave all the difficult multi-device-group scenarios behind on the old
> group centric API and then we don't have to deal with them at all, or
> least not right away.
> 
> I'd see some progression where iommu_fd only works with 1:1 groups at
> the start. Other scenarios continue with the old API.
> 
> Then maybe groups where all devices use the same IOASID.
> 
> Then 1:N groups if the source device is reliably identifiable, this
> requires iommu subystem work to attach domains to sub-group objects -
> not sure it is worthwhile.
> 
> But at least we can talk about each step with well thought out patches
> 
> The only thing that needs to be done to get the 1:1 step is to broadly
> define how the other two cases will work so we don't get into trouble
> and set some way to exclude the problematic cases from even getting to
> iommu_fd in the first place.
> 
> For instance if we go ahead and create /dev/vfio/device nodes we could
> do this only if the group was 1:1, otherwise the group cdev has to be
> used, along with its API.
> 

Thinking more along your direction, here is an updated sketch:

[Stage-1]

Multi-devices group (1:N) is handled by existing vfio group fd and 
vfio_iommu_type1 driver.

Singleton group (1:1) is handled via a new device-centric protocol:

1)   /dev/vfio/device nodes are created for devices in singleton group
       or devices w/o group (mdev)

2)   user gets iommu_fd by open("/dev/iommu"). A default block_dma
       domain is created per iommu_fd (or globally) with an empty I/O 
       page table. 

3)   iommu_fd reports that only 1:1 group is supported

4)   user gets device_fd by open("/dev/vfio/device"). At this point
       mmap() should be blocked since a security context hasn't been 
       established for this fd. This could be done by returning an error 
       (EACCESS or EAGAIN?), or succeeding w/o actually setting up the 
       mapping.

5)   user requests to bind device_fd to iommu_fd which verifies the 
       group is not 1:N (for mdev the check is on the parent device). 
       Successful binding automatically attaches the device to the block_
       dma domain via iommu_attach_group(). From now on the user is 
       permitted to access the device. If mmap() in 3) is allowed, vfio 
       actually sets up the MMIO mapping at this point.

6)   before the device is unbound from iommu_fd, it is always in a 
       security context. Attaching/detaching just switches the security
       context between the block_dma domain and an ioasid domain.

7)   Unbinding detaches the device from the block_dma domain and
       re-attach it to the default domain. From now on the user should 
       be denied from accessing the device. vfio should tear down the
       MMIO mapping at this point.

[Stage-2]

Both 1:1 and 1:N groups are handled via the new device-centric protocol. 
Old vfio uAPI is kept for legacy applications. All devices in the same group 
must share the same I/O address space.

A key difference from stage-1 is the additional check on group viability:

1)   vfio creates /dev/vfio/device nodes for all devices

2)   Same as stage-1 for getting iommu_fd

3)   iommu_fd reports that both 1:1 and 1:N groups are supported

4)   Same as stage-1 for getting device_fd 

5)   when receiving the binding call for the 1st device in a group, iommu
       fd does several things:

        a) Identify the group of this device and check group viability. A group 
            is viable only when all devices in the group are in one of below states:

                * driver-less
                * bound to a driver which is same as the one which does the
                   binding call (vfio in this case)
                * bound to an otherwise allowed driver (which indicates that it
                   is safe for iommu_fd usage around probe())

        b) Attach all devices in the group to the block_dma domain, via existing
             iommu_attach_group().

        c) Register a notifier callback to verifie group viability on IOMMU_GROUP_
            NOTIFY_BOUND_DRIVER event. BUG_ON() might be eliminated if
            we can find a way to deny probe of non-iommu-safe drivers.
                 
        From now on the user is permitted to access the device. Similar to 
        stage-1, vfio may set up the MMIO mapping at this point.

6)   Binding other devices in the same group just succeed

7)   Before the last device in the group is unbound from iommu_fd, all
       devices in the group (even not bound to iommu_fd) switch together 
       between block_dma domain and ioasid domain, initiated by attaching 
       to or detaching from an ioasid.

        a) iommu_fd verifies that all bound devices in the same group must be
            attached to a single IOASID.

        b) the 1st device attach in the group moves the entire group to use 
             the new IOASID domain.

        c) the last device detach moves the entire group back to the block-dma
            domain.

 8)  A device is allowed to be unbound from iommu_fd when other devices
       in the group are still bound. In this case all devices in this group are still
       attached to a security context (block-dma or ioasid). vfio may still zap 
       the mmio mapping (though still in security context) since it doesn't 
       know group in this new flow. The unbound device should not be bound 
       to another driver which could break the group viability.

9)  When user requests to unbind the last device in the group, iommu_fd
      detaches the whole group from the block-dma domain. All mmio mappings
      must be zapped immediately. Devices in the group are re-attached to 
      the default domain from now on (not safe for user to access).

[Stage-3]

It's still an open whether we want to further allow devices within a group 
attached to different IOASIDs in case that the source devices are reliably 
identifiable. This is an usage not supported by existing vfio and might be
not worthwhile due to improved isolation over time.

When it's required, iommu layer has to create sub-group objects and 
expose the sub-group topology to userspace. In the meantime, iommu 
API will be extended to allow sub-group attach/detach operations. 

In this case, there is no much difference in stage-2 flow. iommu_fd just
needs to understand the sub-group topology when allowing a group of
devices attached to different IOASIDs. The key is still to enforce that 
the entire group is in iommu_fd managed security contexts (block-dma or 
ioasid) as long as one or more devices in the group are still bound to it.

Thanks
Kevin