[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <YX153R.0PENWW3ING7F1@crapouillou.net>
Date: Thu, 25 Nov 2021 17:29:58 +0000
From: Paul Cercueil <paul@...pouillou.net>
To: Jonathan Cameron <jic23@...nel.org>
Cc: Alexandru Ardelean <ardeleanalex@...il.com>,
Lars-Peter Clausen <lars@...afoo.de>,
Michael Hennerich <Michael.Hennerich@...log.com>,
Sumit Semwal <sumit.semwal@...aro.org>,
Christian König <christian.koenig@....com>,
linux-iio@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-media@...r.kernel.org, dri-devel@...ts.freedesktop.org,
linaro-mm-sig@...ts.linaro.org
Subject: Re: [PATCH 11/15] iio: buffer-dma: Boost performance using
write-combine cache setting
Hi Jonathan,
Le dim., nov. 21 2021 at 17:43:20 +0000, Paul Cercueil
<paul@...pouillou.net> a écrit :
> Hi Jonathan,
>
> Le dim., nov. 21 2021 at 15:00:37 +0000, Jonathan Cameron
> <jic23@...nel.org> a écrit :
>> On Mon, 15 Nov 2021 14:19:21 +0000
>> Paul Cercueil <paul@...pouillou.net> wrote:
>>
>>> We can be certain that the input buffers will only be accessed by
>>> userspace for reading, and output buffers will mostly be accessed
>>> by
>>> userspace for writing.
>>
>> Mostly? Perhaps a little more info on why that's not 'only'.
>
> Just like with a framebuffer, it really depends on what the
> application does. Most of the cases it will just read sequentially an
> input buffer, or write sequentially an output buffer. But then you
> get the exotic application that will try to do something like alpha
> blending, which means read+write. Hence "mostly".
>
>>>
>>> Therefore, it makes more sense to use only fully cached input
>>> buffers,
>>> and to use the write-combine cache coherency setting for output
>>> buffers.
>>>
>>> This boosts performance, as the data written to the output buffers
>>> does
>>> not have to be sync'd for coherency. It will halve performance if
>>> the
>>> userspace application tries to read from the output buffer, but
>>> this
>>> should never happen.
>>>
>>> Since we don't need to sync the cache when disabling CPU access
>>> either
>>> for input buffers or output buffers, the .end_cpu_access()
>>> callback can
>>> be dropped completely.
>>
>> We have an odd mix of coherent and non coherent DMA in here as you
>> noted,
>> but are you sure this is safe on all platforms?
>
> The mix isn't safe, but using only coherent or only non-coherent
> should be safe, yes.
>
>>
>>>
>>> Signed-off-by: Paul Cercueil <paul@...pouillou.net>
>>
>> Any numbers to support this patch? The mapping types are performance
>> optimisations so nice to know how much of a difference they make.
>
> Output buffers are definitely faster in write-combine mode. On a
> ZedBoard with a AD9361 transceiver set to 66 MSPS, and buffer/size
> set to 8192, I would get about 185 MiB/s before, 197 MiB/s after.
>
> Input buffers... early results are mixed. On ARM32 it does look like
> it is slightly faster to read from *uncached* memory than reading
> from cached memory. The cache sync does take a long time.
>
> Other architectures might have a different result, for instance on
> MIPS invalidating the cache is a very fast operation, so using cached
> buffers would be a huge win in performance.
>
> Setups where the DMA operations are coherent also wouldn't require
> any cache sync and this patch would give a huge win in performance.
>
> I'll run some more tests next week to have some fresh numbers.
I think I mixed things up before, because I get different results now.
Here are some fresh benchmarks, triple-checked, using libiio's
iio_readdev and iio_writedev tools, with 64K samples buffers at 61.44
MSPS (max. theorical throughput: 234 MiB/s):
iio_readdev -b 65536 cf-ad9361-lpc voltage0 voltage1 | pv > /dev/null
pv /dev/zero | iio_writedev -b 65536 cf-ad9361-dds-core-lpc voltage0
voltage1
Coherent mapping:
- fileio:
read: 125 MiB/s
write: 141 MiB/s
- dmabuf:
read: 171 MiB/s
write: 210 MiB/s
Coherent reads + Write-combine writes:
- fileio:
read: 125 MiB/s
write: 141 MiB/s
- dmabuf:
read: 171 MiB/s
write: 210 MiB/s
Non-coherent mapping:
- fileio:
read: 119 MiB/s
write: 124 MiB/s
- dmabuf:
read: 159 MiB/s
write: 124 MiB/s
Non-coherent reads + write-combine writes:
- fileio:
read: 119 MiB/s
write: 140 MiB/s
- dmabuf:
read: 159 MiB/s
write: 210 MiB/s
Non-coherent mapping with no cache sync:
- fileio:
read: 156 MiB/s
write: 123 MiB/s
- dmabuf:
read: 234 MiB/s (capped by sample rate)
write: 182 MiB/s
Non-coherent reads with no cache sync + write-combine writes:
- fileio:
read: 156 MiB/s
write: 140 MiB/s
- dmabuf:
read: 234 MiB/s (capped by sample rate)
write: 210 MiB/s
A few things we can deduce from this:
* Write-combine is not available on Zynq/ARM? If it was working, it
should give a better performance than the coherent mapping, but it
doesn't seem to do anything at all. At least it doesn't harm
performance.
* Non-coherent + cache invalidation is definitely a good deal slower
than using coherent mapping, at least on ARM32. However, when the cache
sync is disabled (e.g. if the DMA operations are coherent) the reads
are much faster.
* The new dma-buf based API is a great deal faster than the fileio API.
So in the future we could use coherent reads + write-combine writes,
unless we know the DMA operations are coherent, and in this case use
non-coherent reads + write-combine writes.
Regarding this patch, unfortunately I cannot prove that write-combine
is faster, so I'll just drop this patch for now.
Cheers,
-Paul
Powered by blists - more mailing lists