[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20110510130755.GD19909@8bytes.org>
Date: Tue, 10 May 2011 15:07:55 +0200
From: Joerg Roedel <joro@...tes.org>
To: Marek Szyprowski <m.szyprowski@...sung.com>
Cc: linux-kernel@...r.kernel.org, Arnd Bergmann <arnd@...db.de>,
Russell King <rmk@....linux.org.uk>,
FUJITA Tomonori <fujita.tomonori@....ntt.co.jp>,
David Woodhouse <dwmw2@...radead.org>,
Muli Ben-Yehuda <muli@...ibm.com>, x86@...nel.org,
linux-arm-kernel@...ts.infradead.org,
Joerg Roedel <joerg.roedel@....com>,
Kyungmin Park <kyungmin.park@...sung.com>
Subject: Re: [RFC] Generic dma_ops using iommu-api - some thoughts
On Tue, May 10, 2011 at 09:42:18AM +0200, Marek Szyprowski wrote:
> On 2011-05-09 15:24, Joerg Roedel wrote:
>> Type 2: Full-isolation capable IOMMUs. There are only two of them known
>> to me: VT-d and AMD-Vi. These IOMMUs support a full 64 bit
>> device address space and have support for full-isolation. This
>> means that they can configure a seperate address space for each
>> device.
>> These IOMMUs may also have support for Interrupt remapping. But
>> this feature is not subject of the IOMMU-API
> I think that most IOMMUs on SoC can be also put into the type 2
> cathegory, at least the one that I'm working with fits there.
Fine, probably it turns out that we only care about type 2, good to know
that the user-base there grows.
>> The DMA-API manages the address allocator, so it needs to keep track of
>> the allocator state for each domain. This can be solved by storing a
>> private pointer into a domain.
> Embedding address allocator into the iommu domain seems resonable to me.
> In my initial POC implementation of iommu for Samsung ARM platform I've
> put the default domain and address allocator directly into archdata.
Seems like I was a bit unclear :) What I meant was that the IOMMU driver
is responsible for allocating a default-domain for each device and
provide that via the IOMMU-API. Storing the dev-->domain relation in
dev->archdata certainly makes sense.
>> Also, the IOMMU driver may need to put multiple devices into the same
>> domain. This is necessary for type 2 IOMMUs too because the hardware
>> may not be able to distinguisch between all devices (so it is usually
>> not possible to distinguish between different 32-bit PCI devices on the
>> same bus). Support for different domains is even more limited on type 1
>> IOMMUs. The AMD NB-GART supports only one domain for all devices.
>> Therefore it may be helpful to find the domain associated with one
>> device. This is also needed for the DMA-API to get a pointer to the
>> default domain for each device.
> I wonder if the device's default domain is really a property of the
> IOMMU driver. IMHO it is more related to the specific architecture
> configuration rather the iommu chip itself, especially in the embedded
> world. On Samsung Exynos4 platform we have separate iommu blocks for
> each multimedia device block. The iommu controllers are exactly the
> same, but the multimedia device they controll have different memory
> requirements in therms of supported address space limits or alignment.
> That's why I would prefer to put device's default iommu domain (with
> address space allocator and restrictions) to dev->archdata instead of
> extending iommu api.
When the devices differ in their alignment requirements and alignment
then this is fine. Things like the dma_mask and the coherent_dma_mask
are part of the device structure anyway and a DMA-API implementation
needs to take care of them. As long as all IOMMUs look the same the
IOMMU driver can handle them. Things get more difficult if you want to
manage different types of IOMMUs at the same time. Does this requirement
exist?
>> DMA-API Considerations
>> -----------------------------------------------------------------------
>>
>> The question here is which address allocator should be implemented.
>> Almost all IOMMU drivers today implement a bitmap based allocator. This
>> one has advantages because it is very simple, has proven existing code
>> which can be reused and allows neat optimizations in IOMMU TLB flushing.
>> Flushing the TLB of an IOMMU is usually an expensive operation.
>>
>> On the other hand the bitmap allocator does not scale very well with the
>> size of the remapable area. Therefore the VT-d driver implements a
>> tree-based allocator which can handle a large address space efficiently,
>> but does not allow to optimize IO/TLB flushing.
> How IO/TLB flush operation can be optimized with bitmap-based allocator?
> Creating a bitmap for the whole 32-bit area (4GiB) is a waste of memory
> imho, but with so large address space the size of a bitmap can be
> reduced by using lower granularity than a page size - for example 64KiB,
> what will reduce the size of the bitmap by 16 times.
Interesting point. We need to make the IO-page-size part of the API. But
to answer your questions, the bitmap-allocator allows to optimize
io-tlb-flushes as follows:
The allocator needs to keep track of a next_bit field which points to
the end of the last allocation made. After every successful allocation
the next_bit pointer is increased. New alloctions start searching the
bit-field from the next_bit position instead of 0. When doing this, the
io-tlb only needs to be flushed in two cases:
a) When next_bit wraps around, because then old mapping may be
reused
b) When addresses beyond the next_bit pointer are freed
Otherwise we need to do io-tlb flushes every time we free addresses
which is more expensive.
Also, you don't need to keep a bitmap for all 4GB at the same time. The
AMD IOMMU driver also supports aperture sizes up to 4GB, but it manages
the aperture in 128MB chunks. This means that initially the aperture is
only 128MB in size. If allocation within this range fails the aperture
is exented by another 128MB chunk and so on. The aperture can grow up to
4GB. So this can be optimized too. You can have a look into the AMD
IOMMU driver which implements this (and also the optimized flushing I
explained above).
Another option is that you only emulate a 128MB aperture and direct-map
everything outside of the aperture. This brings the best performance
because you don't need to remap every time. The downside is that you
loose device isolation.
A combination of both approaches is the third option. You still have the
aperture and the direct-mapping. But instead of direct-mapping
everything in advance you just direct-map the areas the device asks
for. This gives you device-isolation back. It still lowers pressure on
the address allocator but is certainly more expensive and more difficult
to implement (you need to reference-count mappings for example).
>> It remains to be determined which allocator algortihm fits best.
> Bitmap allocators usually use first fit algorithm. IMHO the allocation
> algorithm matters only if the address space size is small (like the GART
> case), in other cases there is usually not enough memory in the system
> to cause so much fragmentation of the virtual address space.
Yes, fragmentation is not a big issue, even with smapp address spaces.
Most mappings fit into one page so that they can not cause
fragmentation.
Regards,
Joerg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists