[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ad7e9aa7-74a3-449d-8ed9-cb270fd5c718@linaro.org>
Date: Thu, 12 Jun 2025 15:14:32 +0100
From: James Clark <james.clark@...aro.org>
To: Vladimir Oltean <vladimir.oltean@....com>, Arnd Bergmann <arnd@...db.de>
Cc: Frank Li <Frank.li@....com>, Vladimir Oltean <olteanv@...il.com>,
Mark Brown <broonie@...nel.org>, linux-spi@...r.kernel.org,
imx@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/4] spi: spi-fsl-dspi: Use non-coherent memory for DMA
On 12/06/2025 12:15 pm, Vladimir Oltean wrote:
> On Thu, Jun 12, 2025 at 12:05:26PM +0100, James Clark wrote:
>> (No idea why it goes faster when it's under load, but I hope that can be
>> ignored for this test)
>
> Might be because of dynamic CPU frequency scaling as done by the governor.
> If the CPU utilization of spidev_test isn't high enough, the governor
> will prefer lower CPU frequencies. You can try to repeat the test with
> the "performance" governor and/or setting the min frequency equal to the
> max one.
>
That doesn't seem to make a difference, I get the same results with
this. Even for the "fixed" DMA test results below there is a similar
small performance increase when stressing the system:
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_min_freq
1300000
...
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_max_freq
1300000
...
# cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
performance
...
> That's why I don't like the DMA mode in DSPI, it's still CPU-bound,
> because the DMA buffers are very small (you can only provide one TX FIFO
> worth of data per DMA transfer, rather than the whole buffer).
Is that right? The FIFO size isn't used in any of the DMA codepaths, it
looks like the whole DMA buffer is filled before initiating the
transfer. And we increase the buffer to 4k in this patchset to fully use
the existing allocation.
>
> FWIW, the XSPI FIFO performance should be higher.
This leads me to realise a mistake in my original figures. My head was
stuck in target mode where we use DMA so I forgot to force DMA in host
mode to run the performance tests. The previous figures were all XSPI
mode and the small difference in performance could have been just down
to the layout of the code changing?
Changing it to DMA mode gives figures that make much more sense:
Coherent (4096 byte transfers): 6534 kbps
Non-coherent: 7347 kbps
Coherent (16 byte transfers): 447 kbps
Non-coherent: 448 kbps
Just for comparison running the same test in XSPI mode:
4096 byte transfers: 2143 kbps
16 byte transfers: 637 kbps
So for small transfers XSPI is slightly better but for large ones DMA is
much better, with non-coherent memory giving another 800kbps gain.
Perhaps we could find the midpoint and then auto select the mode
depending on the size, but maybe there is latency to consider too which
could be important.
Powered by blists - more mailing lists