linux-kernel - Re: [PATCH] iommu/iova: Bettering utilizing cpu

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <c4c44f66-fb9c-4d71-ac35-f3fef75832bd@arm.com>
Date: Mon, 24 Jun 2024 14:32:12 +0100
From: Robin Murphy <robin.murphy@....com>
To: "zhangzekun (A)" <zhangzekun11@...wei.com>, joro@...tes.org,
 will@...nel.org, john.g.garry@...cle.com
Cc: iommu@...ts.linux.dev, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] iommu/iova: Bettering utilizing cpu_rcaches in no-strict
 mode

On 2024-06-24 2:13 pm, zhangzekun (A) wrote:
> 
> 在 2024/6/24 18:56, Robin Murphy 写道:
>> On 2024-06-24 9:39 am, Zhang Zekun wrote:
>>> Currently, when iommu working in no-strict mode, fq_flush_timeout()
>>> will always try to free iovas on one cpu. Freeing the iovas from all
>>> cpus on one cpu is not cache-friendly to iova_rcache, because it will
>>> first filling up the cpu_rcache and then pushing iovas to the depot,
>>> iovas in the depot will finally goto the underlying rbtree if the
>>> depot_size is greater than num_online_cpus().
>>
>> That is the design intent - if the excess magazines sit in the depot 
>> long enough to be reclaimed then other CPUs didn't want them either. 
>> We're trying to minimise the amount of unused cached IOVAs sitting 
>> around wasting memory, since IOVA memory consumption has proven to be 
>> quite significant on large systems.
> 
> Hi, Robin,
> 
> It does waste some memory after this change, but since we have been 
> freeing iova on each cpu in strict-mode for years, I think this change 
> seems reasonable to make the iova free logic identical to strict-mode.
> This patch try to make the speed of allcating and freeing iova roughly 
> same by better utilizing the iova_rcache, or we will more likely enter 
> the slowpath. The save of memory consumption is actually at the cost of 
> performance, I am not sure if we need such a optimization for no-strict 
> mode which is mainly used for performance consideration.
> 
>>
>> As alluded to in the original cover letter, 100ms for IOVA_DEPOT_DELAY 
>> was just my arbitrary value of "long enough" to keep the initial 
>> implementation straightforward - I do expect that certain workloads 
>> might benefit from tuning it differently, but without proof of what 
>> they are and what they want, there's little justification for 
>> introducing extra complexity and potential user ABI yet.
>>
>>> Let fq_flush_timeout()
>>> freeing iovas on cpus who call dma_unmap* APIs, can decrease the overall
>>> time caused by fq_flush_timeout() by better utilizing the iova_rcache,
>>> and minimizing the competition for the iova_rbtree_lock().
>>
>> I would have assumed that a single CPU simply throwing magazines into 
>> the depot list from its own percpu cache is quicker, or at least no 
>> slower, than doing the same while causing additional 
>> contention/sharing by interfering with other percpu caches as well. 
>> And where does the rbtree come into that either way? If an observable 
>> performance issue actually exists here, I'd like a more detailed 
>> breakdown to understand it.
>>
>> Thanks,
>> Robin.
>>
> 
> This patch is firstly intent to minimize the chance of softlock issue in 
> fq_flush_timeout(), which is already dicribed erarlier in [1], which has 
> beed applied in a commercial kernel[2] for years.
> 
> However, the later tests show that this single patch is not enough to 
> fix the softlockup issue, since the root cause of softlockup is the 
> underlying iova_rbtree_lock. In our softlockup scenarios, the average
> time cost to get this spinlock is about 6ms.

That should already be fixed, though. The only reason for 
fq_flush_timeout() to interact with the rbtree at all was due to the 
notion of a fixed-size depot which could become full. That no longer 
exists since 911aa1245da8 ("iommu/iova: Make the rcache depot scale 
better").

Thanks,
Robin.