linux-kernel - Re: [PATCH 1/2] Isolation groups

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1332364378.13855.236.camel@bling.home>
Date:	Wed, 21 Mar 2012 15:12:58 -0600
From:	Alex Williamson <alex.williamson@...hat.com>
To:	David Gibson <david@...son.dropbear.id.au>
Cc:	aik@...abs.ru, dwmw2@...radead.org,
	iommu@...ts.linux-foundation.org, benh@...nel.crashing.org,
	qemu-devel@...gnu.org, joerg.roedel@....com, kvm@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 1/2] Isolation groups

On Sat, 2012-03-17 at 15:57 +1100, David Gibson wrote:
> On Fri, Mar 16, 2012 at 01:31:18PM -0600, Alex Williamson wrote:
> > On Fri, 2012-03-16 at 14:45 +1100, David Gibson wrote:
> > > On Thu, Mar 15, 2012 at 02:15:01PM -0600, Alex Williamson wrote:
> > > > On Wed, 2012-03-14 at 20:58 +1100, David Gibson wrote:
> > > > > On Tue, Mar 13, 2012 at 10:49:47AM -0600, Alex Williamson wrote:
> > > > > > On Wed, 2012-03-14 at 01:33 +1100, David Gibson wrote:
> > > > > > > On Mon, Mar 12, 2012 at 04:32:54PM -0600, Alex Williamson wrote:
> > > > > > > > +/*
> > > > > > > > + * Add a device to an isolation group.  Isolation groups start empty and
> > > > > > > > + * must be told about the devices they contain.  Expect this to be called
> > > > > > > > + * from isolation group providers via notifier.
> > > > > > > > + */
> > > > > > > 
> > > > > > > Doesn't necessarily have to be from a notifier, particularly if the
> > > > > > > provider is integrated into host bridge code.
> > > > > > 
> > > > > > Sure, a provider could do this on it's own if it wants.  This just
> > > > > > provides some infrastructure for a common path.  Also note that this
> > > > > > helps to eliminate all the #ifdef CONFIG_ISOLATION in the provider.  Yet
> > > > > > to be seen whether that can reasonably be the case once isolation groups
> > > > > > are added to streaming DMA paths.
> > > > > 
> > > > > Right, but other than the #ifdef safety, which could be achieved more
> > > > > simply, I'm not seeing what benefit the infrastructure provides over
> > > > > directly calling the bus notifier function.  The infrastructure groups
> > > > > the notifiers by bus type internally, but AFAICT exactly one bus
> > > > > notifier call would become exactly one isolation notifier call, and
> > > > > the notifier callback itself would be almost identical.
> > > > 
> > > > I guess I don't see this as a fundamental design point of the proposal,
> > > > it's just a convenient way to initialize groups as a side-band addition
> > > > until isolation groups become a more fundamental part of the iommu
> > > > infrastructure.  If you want to do that level of integration in your
> > > > provider, do it and make the callbacks w/o the notifier.  If nobody ends
> > > > up using them, we'll remove them.  Maybe it will just end up being a
> > > > bootstrap.  In the typical case, yes, one bus notifier is one isolation
> > > > notifier.  It does however also allow one bus notifier to become
> > > > multiple isolation notifiers, and includes some filtering that would
> > > > just be duplicated if every provider decided to implement their own bus
> > > > notifier.
> > > 
> > > Uh.. I didn't notice any filtering?  That's why I'm asking.
> > 
> > Not much, but a little:
> > 
> > +       switch (action) {
> > +       case BUS_NOTIFY_ADD_DEVICE:
> > +               if (!dev->isolation_group)
> > +                       blocking_notifier_call_chain(&notifier->notifier,
> > +                                       ISOLATION_NOTIFY_ADD_DEVICE, dev);
> > +               break;
> > +       case BUS_NOTIFY_DEL_DEVICE:
> > +               if (dev->isolation_group)
> > +                       blocking_notifier_call_chain(&notifier->notifier,
> > +                                       ISOLATION_NOTIFY_DEL_DEVICE, dev);
> > +               break;
> > +       }
> 
> 
> Ah, I see, fair enough.
> 
> A couple of tangential observations.  First, I suspect using
> BUS_NOTIFY_DEL_DEVICE is a very roundabout way of handling hot-unplug,
> it might be better to have an unplug callback in the group instead.

There's really one already.  Assuming the device is attached to a group
driver, the .remove entry point on the driver will get called first.  It
doesn't get to return error, but it can block.  The DEL_DEVICE will only
happen after the group driver has relinquished the device, or at least
returned from remove().  For a device with no driver, the group would
only learn about it through the notifier, but it probably doesn't need
anything more direct.

> Second, I don't think aborting the call chain early for hot-plug is
> actually a good idea.  I can't see a clear guarantee on the order, so
> individual providers couldn't rely on that short-cut behaviour.  Which
> means that if two providers would have attempted to claim the same
> device, something is seriously wrong and we should probably report
> that.

Yeah, that seems like a reasonable safety check.

> > ...
> > > > > > > So, somewhere, I think we need a fallback path, but I'm not sure
> > > > > > > exactly where.  If an isolation provider doesn't explicitly put a
> > > > > > > device into a group, the device should go into the group of its parent
> > > > > > > bridge.  This covers the case of a bus with IOMMU which has below it a
> > > > > > > bridge to a different type of DMA capable bus (which the IOMMU isn't
> > > > > > > explicitly aware of).  DMAs from devices on the subordinate bus can be
> > > > > > > translated by the top-level IOMMU (assuming it sees them as coming
> > > > > > > from the bridge), but they all need to be treated as one group.
> > > > > > 
> > > > > > Why would the top level IOMMU provider not set the isolation group in
> > > > > > this case.
> > > > > 
> > > > > Because it knows nothing about the subordinate bus.  For example
> > > > > imagine a VT-d system, with a wacky PCI card into which you plug some
> > > > > other type of DMA capable device.  The PCI card is acting as a bridge
> > > > > from PCI to this, let's call it FooBus.  Obviously the VT-d code won't
> > > > > have a FooBus notifier, since it knows nothing about FooBus.  But the
> > > > > FooBus devices still need to end up in the group of the PCI bridge
> > > > > device, since their DMA operations will appear as coming from the PCI
> > > > > bridge card to the IOMMU, and can be isolated from the rest of the
> > > > > system (but not each other) on that basis.
> > > > 
> > > > I guess I was imagining that it's ok to have devices without an
> > > > isolation group.
> > > 
> > > It is, but having NULL isolation group has a pretty specific meaning -
> > > it means it's never safe to give that device to userspace, but it also
> > > means that normal kernel driver operation of that device must not
> > > interfere with anything in any iso group (otherwise we can never no
> > > that those iso groups _are_ safe to hand out).  Likewise userspace
> > > operation of any isolation group can't mess with no-group devices.
> > 
> > This is where wanting to use isolation groups as the working unit for an
> > iommu ops layer and also wanting to use iommu ops to replace dma ops
> > seem to collide a bit.  Do we want two interfaces for dma, one group
> > based and one for non-isolated devices?
> 
> Well, I don't know enough about what dwmw2 had planned to answer
> that.  I was assuming the in-kernel dma reply would remain device/bus
> focussed, and how it looked up and used the groups was an internal
> matter.  I would expect it would probably need a fallback path for "no
> group" for transition purposes, even if it wasn't planned to keep that
> forever.

Hmm, I think the value of a core isolation group layer starts to fall
apart if an iommu driver can't count on a group being present for any
device that might do dma.  Some ideas below.

> >  Isolation providers like
> > intel-iommu would always use one, non-isolation capable dma paths, like
> > swiotlb or non-isolation hardware iommus, would use another.  And how do
> > we factor in the degree of isolation to avoid imposing policy in the
> > kernel?  MSI isolation is an example.  We should allow userspace to set
> > a policy of whether lack of MSI protection is an acceptable risk.  Does
> > that mean we can have isolation groups that provide no isolation and
> > sysfs files indicating capabilities?  Perhaps certain classes of groups
> > don't even allow manager binding?
> 
> Ugh.  I thought about flagging groups with some sort of bitmap of
> isolation strength properties, but I think that way lies madness.  We
> might want a global policy bitmask though which expresses which
> constraints are acceptable risks.  The providers would then have to
> provide groups which are as strong as requested, or none at all.

That's effectively how the current iommu_device_group() interface works;
identify isolate-able groups, or none at all.  I don't think we get that
luxury if we push isolation groups down into the device layer though.
If we want to use them for dma_ops and to solve various device quirks,
they can't only be present on some configurations.  I think you're right
on the global policy though, we just need to apply it differently.  I'm
thinking we need something like "isolation_allow=" allowing options of
"nomsi" and "nop2p" (just for starters).  When a provider creates a
group, they provide a set of flags for the group.  A manager is not
allowed to bind to the group if the global policy doesn't match the
group flags.  We'll need some support functions, maybe even a sysfs
file, so userspace can know in advance and vfio can avoid creating
entries for unusable groups.

> > > None of these conditions is true for the hypothetical Foobus case.
> > > The bus as a whole could be safely given to userspace, the devices on
> > > it *could* mess with an existing isolation group (suppose the group
> > > consisted of a PCI segment with the FooBus bridge plus another PCI
> > > card - FooBus DMAs would be bridged onto the PCI segment and might
> > > target the other card's MMIO space).  And other grouped devices can
> > > certainly mess with the FooBus devices (if userspace grabs the bridge
> > > and manipulates its IOMMU mappings, that would clearly screw up any
> > > kernel drivers doing DMA from FooBus devices behind it).
> > 
> > ---segment-----+---------------+
> >                |               |
> >         [whackyPCIcard]   [other device]
> >                |
> >                +---FooBus---+-------+
> >                             |       |
> >                        [FooDev1] [FooDev2]
> > 
> > This is another example of the quality of the isolation group and what
> > factors we incorporate in judging that.  If the bridge/switch port
> > generating the segment does not support or enable PCI ACS then the IOMMU
> > may be able to differentiate whackyPCIcard from the other device but not
> > prevent peer-to-peer traffic between the two (or between FooDevs and the
> > other device - same difference).  This would suggest that we might have
> > an isolation group at the root of the segment for which the provider can
> > guarantee isolation (includes everything on and below the bus), and we
> > might also have isolation groups at whackyPCIcard and the other device
> > that have a difference quality of isolation.  /me pauses for rant about
> > superior hardware... ;)
> 
> Nested isolation groups? Please no.
> 
> An isolation group at the bridge encompassing that segment was exactly
> what I had in mind, but my point is that all the FooBus devices *also*
> need to be in that group, even though the isolation provider knows
> nothing about FooBus.

If we're able to strictly say "no isolation = no group" and ignore
FooBus for a moment, yes, the group at the bridge makes sense.  But as I
said, I don't think we get that luxury.  There will be different
"acceptable" levels of isolation and we'll need to support both the
model of single group encompassing the segment as well as separate
isolation groups for each device, whackyPCICard/other.

A question then is how do we support both?  Perhaps it's a provider
option whether to only create fully isolated group, which makes it
traverse up to the segment.  The default might be to make separate
groups at whackyPCICard and other, each with the nop2p flag.  It gets a
lot more complicated, but maybe we support both so if system policy
prevents a manager from binding to a sub-group due to nop2p, it can walk
up a level.  I suspect dma_ops always wants the closer group to avoid
traversing levels.

Back to FooBus, dma_ops only knows how to provide dma for certain types
of devices.  intel-iommu only knows about struct pci_dev.  So if FooDev
needed to do dma, it would need some kind of intermediary to do the
mapping via whackyPCICard.  So from that perspective, we could get away
w/o including FooBus devices in the group.  Of course if we want a
manager to attach to the group at whackyPCICard (ignoring nop2p), we
need to put requirements on the state of the FooDevs.  I think that
might actually give some credibility to nested groups, hierarchical
really, ie if we need to walk all the devices below a group, why not
allow groups below a group?  I don't like it either, but I'm not sure
it's avoidable.

> > > Oh.. and I just thought of an existing-in-the-field case with the same
> > > problem.  I'm pretty sure for devices that can appear on PCI but also
> > > have "dumb-bus" versions used on embedded systems, at least some of
> > > the kernel drivers are structured so that there is a separate struct
> > > device for the PCI "wrapper" and the device inside.  If the inner
> > > driver is initiating DMA, as it usually would, it has to be in the
> > > same iso group as it's PCI device parent.
> > 
> > I don't know that we can ever generically identify such devices, but
> > maybe this introduces the idea that we need to be able to blacklist
> > certain devices as multi-homed and tainting any isolation group that
> > contains them.
> 
> Sorry, I was unclear.  These are not multi-homed devices, just
> multiple-variant devices, any given instance will either be PCI or
> not.  But to handle the two variants easily, the drivers have nested
> struct devices.  It's essentially a pure software artifact we're
> dealing with here, but nonetheless we need to get the group-ownership
> of those struct devices correct, whether they're synthetic or not.

So long as the ADD_DEVICE happens with the currently running variant,
which seems like it has to be the case, I'm not sure how this matters.
I'm trying to be very careful to only use struct device for groups.

> > > >  When that happens we can traverse up the bus to find a
> > > > higher level isolation group.
> > > 
> > > Well, that's one possible answer to my "where should the hook be
> > > question": rather than an initialization hook, when we look up a
> > > device's isolation group, if it doesn't say one explicitly, we try
> > > it's bridge parent and so forth up the tree.  I wonder about the
> > > overhead of having to walk all the way up the bus heirarchy before
> > > returning NULL whenever we ask about the group of a device that
> > > doesn't have one.
> > 
> > Yep, that could definitely be a concern for streaming dma.
> 
> Right, which is why I'm suggesting handling at init time instead.
> We'd need something that runs in the generic hot-add path, after bus
> notifiers, which would set the device's group to the same as its
> parent's if a provider hasn't already given it one.

I don't think that works.  Back to the FooBus example, if the isolation
group becomes a fundamental part of the dma_ops path, we may end up with
groups at each FooDev (or maybe one for FooBus) generated by the
intermediary that sets up dma using whackyPCICard.

> > > >  It would probably make sense to have some
> > > > kind of top-most isolation group that contains everything as it's an
> > > > easy proof that if you own everything, you're isolated.
> > > 
> > > Hrm, no.  Things that have no IOMMU above them will have ungated
> > > access to system RAM, and so can never be safely isolated for
> > > userspace purposes, even if userspace owned every _device_ in the
> > > system (not that you could do that in practice, anyway).
> > 
> > RAM is a good point, there are "non-devices" to worry about.
> > Potentially that top level group doesn't allow managers.
> 
> I don't really see the point of a "top-level" group.  Groups are
> exclusive, not heirarchical, and I don't see that there are any paths
> really simplified by having a (not manageable) "leftovers" group.

Yeah, a top level group may not make sense, but if we plan for dma_ops,
we need non-manageable groups, which I think is just the degenerate case
of isolation quality.

> > > >  Potentially
> > > > though, wackyPCIcard can also register isolation groups for each of it's
> > > > FooBus devices if it's able to provide that capability.  Thanks,
> > > 
> > > It could, but why should it.  It doesn't know anything about IOMMUs or
> > > isolation, and it doesn't need to.  Even less so for PCI devices which
> > > create subordinate non-PCI struct devices for internal reasons.
> > 
> > Sorry I wasn't clear.  If wackyPCIcard happens to include an onboard
> > IOMMU of some sort that's capable of isolation, it might be able to
> > register groups for each of the devices below it.  Therefore we could
> > end up with a scenario like above that the segment may not have ACS and
> > therefore not be able to isolate wackyPCIcard from otherdevice, but
> > wackyPCIcard can isolate FooDev1 from FooDev2 and otherdevice.  I think
> > that means we potentially need to check all devices downstream of an
> > isolation group to allow a manager to lock it, as well as checking the
> > path upstream to make sure it isn't used above...  Are you sure we're
> > ready for this kind of infrastructure?  Thanks,
> 
> Hrm, at this stage I think I'm prepared to assume that IOMMUs are
> either strictly independent (e.g. covering different PCI domains) or
> are at least vaguely aware of each other (i.e. platform conventions
> cover the combination).  Truly nested IOMMUs, that are entirely
> unaware of each other is a problem for another day (by which I mean
> probably never).

Ok, but even if there's no iommu onboard whackyPCICard, it can still be
possible to create groups on FooBus just to facilitate dma_ops and
handle quirks.

> That said, again, groups have to be exclusive, not heirarchical, so
> in the above case, I think we'd have a few possible ways to go:

Ownership absolutely has to be exclusive, but it looks like groups
themselves are hierarchical.

> 	1) The FooBus isolation provider could refuse to initialize if
> the FooBus bridge is sharing an iso group with something else before
> it probes the FooBus.

Sharing, as in a parent group is managed?  Yes, something special needs
to happen in that case, I'd probably investigate having it initialized
in a managed state versus not initialized at all.

> 	2) The FooBus isolation provider could be in the FooBus bridge
> driver - meaning that if someone tried to grab the PCI group including
> the FooBridge, the provider driver would be booted out, causing the
> FooBus groups to cease to exist (meaning the PCI group manager would
> never get lock while someone was already using the FooGroups).  In

Right, I don't know that they need to cease to exist, but if a manager
is attached anywhere above or below the group it wants to lock, it has
to be blocked.

> this case, it gets a bit complex.  When the FooBus isolation provider
> is active, the FooBus devices would be in their own groups, not the
> group of the FooBridge and its sibling.  When the FooBus isolation
> provider is removed, it would have to configure the FooBus IOMMU to a
> passthrough mode, and revert the FooBus devices to the parent's
> group.  Hm.  Complicated.

Yep.  I think we're arriving at the same point.  Groups are
hierarchical, but ownership via a manager cannot be nested.  So to
manage a group, we need to walk the entire tree of devices below each
device checking that none of the groups are managed and all the devices
are using the right driver, then walk up from the group to verify no
group of a parent device is managed.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/