linux-kernel - Re: [PATCH 1/4] iommu: Add iommu_device_group callback and iommu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1322720437.26545.49.camel@bling.home>
Date:	Wed, 30 Nov 2011 23:20:37 -0700
From:	Alex Williamson <alex.williamson@...hat.com>
To:	Benjamin Herrenschmidt <benh@...nel.crashing.org>
Cc:	David Gibson <dwg@....ibm.com>, joerg.roedel@....com,
	dwmw2@...radead.org, iommu@...ts.linux-foundation.org,
	linux-kernel@...r.kernel.org, chrisw@...hat.com, agraf@...e.de,
	scottwood@...escale.com, B08248@...escale.com
Subject: Re: [PATCH 1/4] iommu: Add iommu_device_group callback and
 iommu_group sysfs entry

On Wed, 2011-11-30 at 20:23 +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2011-11-29 at 22:25 -0700, Alex Williamson wrote:
> 
> > Note that iommu drivers are registered per bus_type, so the unique pair
> > is {bus_type, groupid}, which seems sufficient for vfio.
> > 
> > > Don't forget that to keep sanity, we really want to expose the groups
> > > via sysfs (per-group dir with symlinks to the devices).
> > > 
> > > I'm working with Alexey on providing an in-kernel powerpc specific API
> > > to expose the PE stuff to whatever's going to interface to VFIO to
> > > create the groups, though we can eventually collapse that. The idea is
> > > that on non-PE capable brigdes (old style), I would make a single group
> > > per host bridge.
> > 
> > If your non-PE capable bridges aren't actually providing isolation, they
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > should return -ENODEV for the group_device() callback, then vfio will
> > ignore them.
> 
> Why ignore them ? It's perfectly fine to consider everything below the
> host bridge as one group. There is isolation ... at the host bridge
> level.

If there is isolation, yes, report a group.  A host bridge by itself
does not imply DMA isolation on most platforms.  We could say devices
without isolation are all in the same group, but then imagine a system
with two PCI host bridges.  Host bridge A has an iommu providing B:D.F
isolation, host bridge B has no iommu and no isolation.  Clearly devices
behind A have groupids.  However if we assign a single groupid for all
the devices behind B, we subvert the isolation of devices behind A.  To
avoid that, devices that cannot be isolated should not have a groupid.

> Really groups should be a structure, not a magic number. We want to
> iterate them and their content, represent them via an API, etc... and so
> magic numbers means that anything under the hood will have to constantly
> convert between that and some kind of internal data structure.

What would a group structure in core driver code contain?  Who manages
it?  Don't forget, most users don't have hardware capable of making use
of this and most users with the right hardware still aren't going to use
it.  Those group structures live in vfio and get allocated when vfio is
loaded.  VT-d and AMD-Vi can support this new groupid interface with
zero additional data structures floating around.

As Chris noted, the choice of an unsigned int to efficiently represent
groupids in PCI isn't just coincidence.  If you can't make use of the
same trick, surely it's not too difficult to enumerate your PEs and use
that as the groupid.

> I also somewhat dislike the bus_type as the anchor to the grouping
> system, but that's not necessarily as bad an issue for us to deal with.
> 
> Eventually what will happen on my side is that I will have a powerpc
> "generic" (ie. accross platforms) that allow to enumerate groups and
> retrieve the dma windows associated with them etc...
> 
> That API will use underlying function pointers provided by the PCI host
> bridge (for which we do have a data structure, struct pci_controller,
> like many other archs except I think x86 :-)
> 
> Any host platform that doesn't provide those pointers (ie. all of them
> initially) will get a default behaviour which is to group everything
> below a host bridge (since host bridges still have independent iommu
> windows, at least for us they all do). 
> 
> On top of that we can implement a "backend" that provides those pointers
> for the p7ioc bridge used on the powernv platform, which will expose
> more fine grained groups based on our "partitionable endpoint"
> mechanism.
> 
> The grouping will have been decided early at boot time based on a mix of
> HW resources and bus topology, plus things like whether there is a PCI-X
> bridge etc... and will be initially immutable.
> 
> Ideally, we need to expose a subset of this API as a "generic" interface
> to allow generic code to iterate the groups and their content, and to
> construct the appropriate sysfs representation.
> 
> > > In addition, Alex, I noticed that you still have the domain stuff there,
> > > which is fine I suppose, we could make it a requirement on power that
> > > you only put a single group in a domain... but the API is still to put
> > > individual devices in a domain, not groups, and that somewhat sucks.
> > > 
> > > You could "fix" that by having some kind of ->domain_enable() or
> > > whatever that's used to "activate" the domain and verifies that it
> > > contains entire groups but that looks like a pointless way to complicate
> > > both the API and the implementation.
> > 
> > Right, groups are currently just a way to identify dependent sets, not a
> > unit of work.  We can also have group membership change dynamically
> > (hotplug slot behind a PCIe-to-PCI bridge), so there are cases where we
> > might need to formally attach/detach a group element to a domain at some
> > later point.  This really hasn't felt like a stumbling point for vfio,
> > at least on x86.  Thanks,
> 
> It doesn't matter much as long as we have a way to know that a group is
> "complete", ie that all devices of a group have been taken over by vfio
> and put into a domain, and block them from being lost. Only then can we
> actually "use" the group and start reconfiguring the iommu etc... for
> use by the guest.

This is done.  I call a group "viable" when all of the devices are bound
to their vfio bus driver and ready for use.  The API will not allow a
user access to the iommu or device file descriptors unless the group is
viable.  Unbinding a device from the vfio bus driver will block as long
as the group is in use (need to decide if we enable a system policy to
let vfio kill the process to release a device).  I'm also working on
code that proactively attaches the vfio bus driver to a hot-added device
if it's being added to an in-use group.

> Note that groups -will- contain briges eventually. We need to take that
> into account since bridges -usually- don't have an ordinary driver
> attached to them so there may be issues there with tracking whether they
> are taken over by vfio...

I look forward to patches for this ;)  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/