linux-kernel - Re: RFC: Device isolation infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111208024357.GB5344@truffala.fritz.box>
Date:	Thu, 8 Dec 2011 13:43:57 +1100
From:	David Gibson <dwg@....ibm.com>
To:	Alex Williamson <alex.williamson@...hat.com>
Cc:	joerg.roedel@....com, dwmw2@...radead.org,
	iommu@...ts.linux-foundation.org, aik@....ibm.com,
	linux-kernel@...r.kernel.org, chrisw@...hat.com, agraf@...e.de,
	scottwood@...escale.com, B08248@...escale.com,
	benh@...nel.crashing.org
Subject: Re: RFC: Device isolation infrastructure

On Wed, Dec 07, 2011 at 12:45:20PM -0700, Alex Williamson wrote:
> On Thu, 2011-12-08 at 01:22 +1100, David Gibson wrote:
> > On Tue, Dec 06, 2011 at 11:46:42PM -0700, Alex Williamson wrote:
> > > On Wed, 2011-12-07 at 14:58 +1100, David Gibson wrote:
> > > > 
> > > > Alex Williamson recently posted some patches adding a new hook to
> > > > iommu_ops to identify groups of devices which cannot reliably be
> > > > distinguished by the iommu, and therefore can only safely be assigned
> > > > as a unit to a guest or userspace driver.
> > > > 
> > > > I'm not very fond of that design, because:
> > > >     - It requires something else to scan the group ids to present the
> > > > groups in a userspace discoverable manner, rather than handling that
> > > > in the core.
> > > 
> > > The thought was that the vast majority of users won't make use of
> > > groups, so why bloat the core when the only user is vfio.  
> > 
> > Well, "won't make use of" actually means "are on a system with
> > all-singleton groups".  And the answer is because it actually doesn't
> > bloat, but simplifies the core.  We need the group concept _at
> > minimum_ to handle bridges and buggy multifunction devices.  So it
> > makes life easier to assume they're always there, even if they're
> > often singletons.
> 
> Ok, let's back up from the patch because I think its clouding the driver
> model with a detached device state that might be detracting from the
> useful bits.

Ok.  I'll start out by saying that I think there's some confusion in
the below because you're tending to identify an isolation group with
an iommu domain, which is not my intention.  An isolation group is the
minimum granularity of an iommu domain, of course, but it's absolutely
my intention that you could build an iommu domain with mutiple groups.
A group is a unit that *can* be isolated, not a unit which *must* be
isolated.

> An isolation group is a generic association of devices into a set based
> on, well... isolation.  Presumably something like an iommu driver does a
> bus_register_notifier() and bus_for_each_dev() and when a new device is
> added we do isolation_group_new() and isolation_group_dev_add() as
> necessary.

Uh, roughly speaking, yes.

>  When the iommu driver needs to do a dma_ops call for a
> device, it can traverse up to the isolation group and find something
> that points it to the shared iommu context for that group.  While
> dma_ops can share the iommu context among the drivers attached to the
> various devices, iommu_ops wants a single meta-driver for the group
> because it's predominantly designed for use when devices are exposed
> directly to the user.  We trust native kernel drivers to share context
> between each other, we do not trust user level drivers to share context
> with anyone but themselves.  Ok so far?

Not so much.  As Ben says, getting the existing kernel dma_ops and so
forth to use the isolation structure is very much a "maybe one day"
thing.

> The first question we run into is what does the iommu driver do if it
> does not provide isolation?  If it's a translation-only iommu, maybe it
> has a single context and can manage to maintain it internally and should
> opt-out of isolation groups.  But we also have things like MSI isolation
> where we have multiple contexts and the isolation strength is sufficient
> for dma_ops, but not iommu_ops.  I don't know if we want to make that
> determination via a strength bitmask at the group level or simply a bool
> that the iommu driver sets when it registers a group.

If the device can't be safely isolated, then it should have no
isolation group.  The definition of "safely" might vary depending on
platform and/or kernel parameters.

> Once we store iommu context at the group, it makes sense to start using
> it as the primary object the iommu drivers operate on.  There's no need
> to break everyone for dma_ops, so isolation aware iommu drivers can
> simply follow the link from the device to the group with no change to
> callers.  iommu_ops though can switch to being entirely group based [not
> looking forward to retrofitting groups into kvm device assignment].

Again, maybe one day.  We don't need to attack that in order to handle
the vfio problem.

> So the next problem is that while the group is the minimum granularity
> for the iommu, it's not necessarily the desired granularity.  iommus
> like VT-d have per PCI BDF context entries that can point to shared page
> tables.  On such systems we also typically have singleton isolation
> groups, so when multiple devices are used by a single user, we have a
> lot of duplication in time and space.  VFIO handles this by allowing
> groups to be "merged".  When this happens, the merged groups point to
> the same iommu context.  I'm not sure what the plan is with isolation
> groups, but we need some way to reduce that overhead.

Right.  So, again, I intend that mutiple groups can go into one
domain.  Not entirely sure of the interface yet.  One I had in mind
was to borrow the vfio1 interface, so you open a /dev/vfio (each open
gives a new instance).  Then you do an "addgroup" ioctl which adds a
group to the domain.  You can do that multiple times, then start using
the domain.

> Finally, we get to how do we manage the devices of the group when we
> switch to iommu_ops mode vs dma_ops, which is what this patch seems to
> have started with.  For VFIO we have bus specific drivers which attach
> to individual devices.  When all the devices in the group are attached
> to the bus driver (by the user), iommu_ops, and therefore access to
> devices from userspace becomes available.
> 
> This is the level that the vast majority of users are not going to make
> use of, so while the above does feel cleaner and justifies it's
> addition, we need to be careful here.
> 
> My concern with the "detached state" approach is that we're removing
> devices from the existing driver model.  Who does things like pm_runtime
> callbacks when we start using devices?

That's a point.  Ben suggests an approach that I'll comment on there.

>  How does a driver enumerate and
> attach to devices within the group?

Well, detaching prevents driver binding.  All the struct devices are
still there and can be examined as usual.  How the group's owner
manages access to individual devices within the group is up to it.

>  How do we even enumerate and attach
> to isolation groups for the group level metadriver?  This is where I'm
> nervous we're just making a requirement for a group-driver model that
> lives alongside the existing device driver model.

Well, sort of.  But the "group driver model" is vastly simpler than
the full driver model, because the assumption is that binding a group
owner to a group is *always* the result of an explicit user request
specifying both owner and group.  There's no matching, no drivers of
different bus types or the other things that make the normal driver
model complicated.

> I actually like the approach that individual devices are bound by the
> user to vfio bus drivers (group/iommu_ops drivers).  This is tedious,
> but easily automate-able in userspace and puts the responsibility of
> unbinding each device on the user.  Imagine the chaos a user could cause
> if vfio creates a chardev for accessing a group, we give a user access
> to it, and each time it's opened all the group devices automatically get
> unbound from the host drivers and re-bound on close.  'while (true); do
> cat /dev/vfio/$GROUP; done' could cause a denial of service on the
> system.

Well, no, because regardless of the interface, detaching the devices
would need equivalent permission.

> We just need to figure out how binding vfio bus drivers to all the
> isolation group devices turns the robot lions into Voltron ;)
> 
> > > >     - It does not provide an obvious means to prevent a kernel
> > > > driver unsafely binding to a new device which is hotplugged into an
> > > > iommu group already assigned to a guest.
> > > 
> > > I've added this for PCI, triggered via the device add notifier.
> > 
> > Hrm.  Does the notifier run early enough to guarantee that a driver
> > can't bind before it's complete?
> 
> Yes, like I assume you're planning to do with setting up groups, it runs
> from the device add bus notifier.  Thanks,
> 
> Alex
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/