[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAFQd5C8asfo8wSa=jKvp4Vmg6A83R-vG7vQXtsHyOTADLo+9g@mail.gmail.com>
Date: Thu, 14 Jan 2016 02:33:00 +0900
From: Tomasz Figa <tfiga@...omium.org>
To: Robin Murphy <robin.murphy@....com>
Cc: Douglas Anderson <dianders@...omium.org>,
Russell King <linux@....linux.org.uk>,
Mauro Carvalho Chehab <mchehab@....samsung.com>,
Marek Szyprowski <m.szyprowski@...sung.com>,
Pawel Osciak <pawel@...iak.com>,
Dmitry Torokhov <dmitry.torokhov@...il.com>,
Will Deacon <will.deacon@....com>,
Andrew Morton <akpm@...ux-foundation.org>,
dan.j.williams@...el.com, Carlo Caione <carlo@...one.org>,
Laurent Pinchart <laurent.pinchart+renesas@...asonboard.com>,
mike.looijmans@...ic.nl, Lorenzo Nava <lorenx4@...il.com>,
"linux-arm-kernel@...ts.infradead.org"
<linux-arm-kernel@...ts.infradead.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation
On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@....com> wrote:
> Hi Doug,
>
>
> On 08/01/16 23:05, Douglas Anderson wrote:
>>
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers. Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes. The function tries to allocate large chunks
>> in order to optimize IOMMU TLB usage.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks. Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all. Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>> - Try to allocate order 10 (no retry)
>> - Try to allocate order 9 (no retry)
>> - ...
>> - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>> - Try to allocate order 9 (no retry)
>> - Try to allocate order 8 (no retry)
>> - ...
>> - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>> sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>> => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases. I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
>> was done on kernel 3.14, so possibly mainline would behave
>> differently).
>>
>> A second problem is that allocating big chunks under memory pressure
>> when we don't need them is just not a great idea anyway unless we really
>> need them. We can make due pretty well with smaller chunks so it's
>> probably wise to leave bigger chunks for other users once memory
>> pressure is on.
>>
>> Let's adjust the allocation like this:
>>
>> 1. If a big chunk fails, stop trying to hard and bump down to lower
>> order allocations.
>> 2. Don't try useless orders. The whole point of big chunks is to
>> optimize the TLB and it can really only make use of 2M, 1M, 64K and
>> 4K sizes.
>>
>> We'll still tend to eat up a bunch of big chunks, but that might be the
>> right answer for some users. A future patch could possibly add a new
>> DMA_ATTR that would let the caller decide that TLB optimization isn't
>> important and that we should use smaller chunks. Presumably this would
>> be a sane strategy for some callers.
>
>
> Now that I've had time to think about it properly:
>
> Reviewed-by: Robin Murphy <robin.murphy@....com>
>
> I just had an absolutely disgusting idea of how to get the same progression
> with just a single variable and no static array, but I'll keep that firmly
> to myself as it's almost IOCCC-grade WTF :D
Just out of curiosity, a bitmap and loop with fls() and clearing bit
on failure or something more freaky? :)
Anyway:
Reviewed-by: Tomasz Figa <tfiga@...omium.org>
Best regards,
Tomasz
Powered by blists - more mailing lists