linux-kernel - Re: [ANNOUNCE] VFIO V6 & public VFIO repositories

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <201012211148.43941.pugs@lyon-about.com>
Date:	Tue, 21 Dec 2010 11:48:43 -0800
From:	Tom Lyon <pugs@...n-about.com>
To:	Benjamin Herrenschmidt <benh@...nel.crashing.org>
Cc:	linux-pci@...r.kernel.org, mbranton@...il.com,
	alexey.zaytsev@...il.com, jbarnes@...tuousgeek.org,
	linux-kernel@...r.kernel.org, kvm@...r.kernel.org,
	randy.dunlap@...cle.com, arnd@...db.de, joro@...tes.org,
	hjk@...utronix.de, avi@...hat.com, gregkh@...e.de,
	chrisw@...s-sol.org, alex.williamson@...hat.com, mst@...hat.com
Subject: Re: [ANNOUNCE] VFIO V6 & public VFIO repositories

On Monday, December 20, 2010 09:37:33 pm Benjamin Herrenschmidt wrote:
> Hi Tom, just wrote that to linux-pci in reply to your VFIO annouce,
> but your email bounced. Alex gave me your ieee one instead, I'm sending
> this copy to you, please feel free to reply on the list !
> 
> Cheers,
> Ben.
> 
> On Tue, 2010-12-21 at 16:29 +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2010-11-22 at 15:21 -0800, Tom Lyon wrote:
> > > VFIO "driver" development has moved to a publicly accessible
> > > respository
> > > 
> > > on github:
> > > 	git://github.com/pugs/vfio-linux-2.6.git
> > > 
> > > This is a clone of the Linux-2.6 tree with all VFIO changes on the vfio
> > > branch (which is the default). There is a tag 'vfio-v6' marking the
> > > latest "release" of VFIO.
> > > 
> > > In addition, I am open-sourcing my user level code which uses VFIO.
> > > It is a simple UDP/IP/Ethernet stack supporting 3 different VFIO based
> > > 
> > > hardware drivers. This code is available at:
> > > 	git://github.com/pugs/vfio-user-level-drivers.git
> > 
> > So I do have some concerns about this...
> > 
> > So first, before I go into the meat of my issues, let's just drop a
> > quick one about the interface: why netlink ? I find it horrible
> > myself... Just confuses everything and adds overhead. ioctl's would have
> > been a better choice imho.
> > 
> > Now, my actual issues, which in fact extend to the whole "generic" iommu
> > APIs that have been added to drivers/pci for "domains", and that in
> > turns "stains" VFIO in ways that I'm not sure I can use on POWER...
> > 
> > I would appreciate your input on how you think is the best way for me to
> > solve some of these "mismatches" between our HW and this design.
> > 
> > Basically, the whole iommu domain stuff has been entirely designed
> > around the idea that you can create those "domains" which are each an
> > entire address space, and put devices in there.
> > 
> > This is sadly not how the IBM iommus work on POWER today...
> > 
> > I have currently one "shared" DMA address space (per host bridge), but I
> > can assign regions of it to different devices (and I have limited
> > filtering capabilities so basically, a bus per region, a device per
> > region or a function per region).
> > 
> > That means essentially that I cannot just create a mapping for the DMA
> > addresses I want, but instead, need to have some kind of "allocator" for
> > DMA translations (which we have in the kernel, ie, dma_map/unmap use a
> > bitmap allocator).
> > 
> > I generally have 2 regions per device, one in 32-bit space of quite
> > limited size (some times as small as 128M window) and one in 64-bit
> > space that I can make quite large if I need to, enough to map all of
> > memory if that's really desired, using large pages or something like
> > that).
> > 
> > Now that has various consequences vs. the interfaces betweem iommu
> > 
> > domains and qemu, and VFIO:
> >  - I don't quite see how I can translate the concept of domains and
> > 
> > attaching devices to such domains. The basic idea won't work. The
> > domains in my case are essentially pre-existing, not created on-the-fly,
> > and may contain multiple devices tho I suppose I can assume for now that
> > we only support KVM pass-through with 1 device == 1 domain.
> > 
> > I don't know how to sort that one out if the userspace or kvm code
> > assumes it can put multiple devices in one domain and they start to
> > magically share the translations...
> > 
> > Not sure what the right approach here is. I could make the "Linux"
> > domain some artifical SW construct that contains a list of the real
> > iommu's it's "bound" to and establish translations in all of them... but
> > that isn't very efficient. If the guest kernel explicitely use some
> > iommu PV ops targeting a device, I need to only setup translations for
> > -that- device, not everything in the "domain".
> > 
> >  - The code in virt/kvm/iommu.c that assumes it can map the entire guest
> > 
> > memory 1:1 in the IOMMU is just not usable for us that way. We -might-
> > be able to do that for 64-bit capable devices as we can create quite
> > large regions in the 64-bit space, but at the very least we need some
> > kind of offset, and the guest must know about it...
> > 
> >  - Similar deal with all the code that currently assume it can pick up a
> > 
> > "virtual" address and create a mapping from that. Either we provide an
> > allocator, or if we want to keep the flexibility of userspace/kvm
> > choosing the virtual addresses (preferable), we need to convey some
> > "ranges" information down to the user.
> > 
> >  - Finally, my guest are always paravirt. There's well defined Hcalls
> > 
> > for inserting/removing DMA translations and we're implementing these
> > since existing kernels already know how to use them. That means that
> > overall, I might simply not need to use any of the above.
> > 
> > IE. I could have my own infrastructure for iommu, my H-calls populating
> > the target iommu directly from the kernel (kvm) or qemu (via ioctls in
> > the non-kvm case). Might be the best option ... but that would mean
> > somewhat disentangling VFIO from uiommu...
> > 
> > Any suggestions ? Great ideas ?

Ben - I don't have any good news for you.

DMA remappers like on Power and Sparc have been around forever, the new thing 
about Intel/AMD iommus is the per-device address spaces and the protection 
inherent in having separate mappings for each device.  If one is to trust a 
user level app or virtual machine to program DMA registers directly, then you 
really need per device translation.

That said, early versions of VFIO had a mapping mode that used the normal DMA 
API instead of the iommu/uiommu api and assumed that the user was trusted, but 
that wasn't interesting for the long term.

So if you want safe device assigment you're going to need hardware help.


> > 
> > Cheers,
> > Ben.
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> > the body of a message to majordomo@...r.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/