[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c4c44f66-fb9c-4d71-ac35-f3fef75832bd@arm.com>
Date: Mon, 24 Jun 2024 14:32:12 +0100
From: Robin Murphy <robin.murphy@....com>
To: "zhangzekun (A)" <zhangzekun11@...wei.com>, joro@...tes.org,
will@...nel.org, john.g.garry@...cle.com
Cc: iommu@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] iommu/iova: Bettering utilizing cpu_rcaches in no-strict
mode
On 2024-06-24 2:13 pm, zhangzekun (A) wrote:
>
> 在 2024/6/24 18:56, Robin Murphy 写道:
>> On 2024-06-24 9:39 am, Zhang Zekun wrote:
>>> Currently, when iommu working in no-strict mode, fq_flush_timeout()
>>> will always try to free iovas on one cpu. Freeing the iovas from all
>>> cpus on one cpu is not cache-friendly to iova_rcache, because it will
>>> first filling up the cpu_rcache and then pushing iovas to the depot,
>>> iovas in the depot will finally goto the underlying rbtree if the
>>> depot_size is greater than num_online_cpus().
>>
>> That is the design intent - if the excess magazines sit in the depot
>> long enough to be reclaimed then other CPUs didn't want them either.
>> We're trying to minimise the amount of unused cached IOVAs sitting
>> around wasting memory, since IOVA memory consumption has proven to be
>> quite significant on large systems.
>
> Hi, Robin,
>
> It does waste some memory after this change, but since we have been
> freeing iova on each cpu in strict-mode for years, I think this change
> seems reasonable to make the iova free logic identical to strict-mode.
> This patch try to make the speed of allcating and freeing iova roughly
> same by better utilizing the iova_rcache, or we will more likely enter
> the slowpath. The save of memory consumption is actually at the cost of
> performance, I am not sure if we need such a optimization for no-strict
> mode which is mainly used for performance consideration.
>
>>
>> As alluded to in the original cover letter, 100ms for IOVA_DEPOT_DELAY
>> was just my arbitrary value of "long enough" to keep the initial
>> implementation straightforward - I do expect that certain workloads
>> might benefit from tuning it differently, but without proof of what
>> they are and what they want, there's little justification for
>> introducing extra complexity and potential user ABI yet.
>>
>>> Let fq_flush_timeout()
>>> freeing iovas on cpus who call dma_unmap* APIs, can decrease the overall
>>> time caused by fq_flush_timeout() by better utilizing the iova_rcache,
>>> and minimizing the competition for the iova_rbtree_lock().
>>
>> I would have assumed that a single CPU simply throwing magazines into
>> the depot list from its own percpu cache is quicker, or at least no
>> slower, than doing the same while causing additional
>> contention/sharing by interfering with other percpu caches as well.
>> And where does the rbtree come into that either way? If an observable
>> performance issue actually exists here, I'd like a more detailed
>> breakdown to understand it.
>>
>> Thanks,
>> Robin.
>>
>
> This patch is firstly intent to minimize the chance of softlock issue in
> fq_flush_timeout(), which is already dicribed erarlier in [1], which has
> beed applied in a commercial kernel[2] for years.
>
> However, the later tests show that this single patch is not enough to
> fix the softlockup issue, since the root cause of softlockup is the
> underlying iova_rbtree_lock. In our softlockup scenarios, the average
> time cost to get this spinlock is about 6ms.
That should already be fixed, though. The only reason for
fq_flush_timeout() to interact with the rbtree at all was due to the
notion of a fixed-size depot which could become full. That no longer
exists since 911aa1245da8 ("iommu/iova: Make the rcache depot scale
better").
Thanks,
Robin.
Powered by blists - more mailing lists