linux-kernel - Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9aba4d8b-6a46-6c08-2568-2e7490723526@oracle.com>
Date:   Mon, 21 Aug 2023 12:35:14 +0100
From:   John Garry <john.g.garry@...cle.com>
To:     Robin Murphy <robin.murphy@....com>, joro@...tes.org
Cc:     will@...nel.org, iommu@...ts.linux.dev,
        linux-kernel@...r.kernel.org, zhangzekun11@...wei.com
Subject: Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

On 16/08/2023 16:10, Robin Murphy wrote:
> On 15/08/2023 2:35 pm, John Garry wrote:
>> On 15/08/2023 12:11, Robin Murphy wrote:
>>>>
>>>> This threshold is the number of online CPUs, right?
>>>
>>> Yes, that's nominally half of the current fixed size (based on all 
>>> the performance figures from the original series seemingly coming 
>>> from a 16-thread machine, 
>>
>> If you are talking about 
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@huawei.com/__;!!ACWV5N9M2RV99hQ!Op6GUnd7phh1sFyJwVOngmoeyKKbHWbSsNkhPB_7BpG45JFOHmN0HQ0Y7NOZZQ7VduKXaRYCXTta8LjrS99neyg$ ,
> 
> No, I mean the *original* rcache patch submission, and its associated 
> paper:
> 
> https://urldefense.com/v3/__https://lore.kernel.org/linux-iommu/cover.1461135861.git.mad@cs.technion.ac.il/__;!!ACWV5N9M2RV99hQ!Op6GUnd7phh1sFyJwVOngmoeyKKbHWbSsNkhPB_7BpG45JFOHmN0HQ0Y7NOZZQ7VduKXaRYCXTta8LjrOGggfnA$

oh, that one :)

>> then I think it's a 256-CPU system and the DMA controller has 16 HW 
>> queues. The 16 HW queues are relevant as the per-completion queue 
>> interrupt handler runs on a fixed CPU from the set of 16 CPUs in the 
>> HW queue interrupt handler affinity mask. And what this means is while 
>> any CPU may alloc an IOVA, only those 16 CPUs handling each HW queue 
>> interrupt will be free'ing IOVAs.
>>
>>> but seemed like a fair compromise. I am of course keen to see how 
>>> real-world testing actually pans out.
>>>
>>>>> it's enough of a challenge to get my 4-core dev board with spinning 
>>>>> disk
>>>>> and gigabit ethernet to push anything into a depot at all 😄
>>>>>
>>>>
>>>> I have to admit that I was hoping to also see a more aggressive 
>>>> reclaim strategy, where we also trim the per-CPU rcaches when not in 
>>>> use. Leizhen proposed something like this a long time ago.
>>>
>>> Don't think I haven't been having various elaborate ideas for making 
>>> it cleverer with multiple thresholds and self-tuning, however I have 
>>> managed to restrain myself 😉
>>>
>>
>> OK, understood. My main issue WRT scalability is that the total 
>> cacheable IOVAs (CPU and depot rcache) scales up with the number of 
>> CPUs, but many DMA controllers have a fixed number of max in-flight 
>> requests.
>>
>> Consider a SCSI storage controller on a 256-CPU system. The in-flight 
>> limit for this example controller is 4096, which would typically never 
>> be even used up or may not be even usable.
>>
>> For this device, we need 4096 * 6 [IOVA rcache range] = ~24K cached 
>> IOVAs if we were to pre-allocate them all - obviously I am ignoring 
>> that we have the per-CPU rcache for speed and it would not make sense 
>> to share one set. However, according to current IOVA driver, we can in 
>> theory cache upto ((256 [CPUs] * 2 [loaded + prev]) + 32 [depot size]) 
>> * 6 [rcache range] * 128 (IOVA per mag) = ~420K IOVAs. That's ~17x 
>> what we would ever need.
>>
>> Something like NVMe is different, as its total requests can scale up 
>> with the CPU count, but only to a limit. I am not sure about network 
>> controllers.
> 
> Remember that this threshold only represents a point at which we 
> consider the cache to have grown "big enough" to start background 
> reclaim - over the short term it is neither an upper nor a lower limit 
> on the cache capacity itself. Indeed it will be larger than the working 
> set of some workloads, but then it still wants to be enough of a buffer 
> to be useful for others which do make big bursts of allocations only 
> periodically.
> 

It would be interesting to see what zhangzekun finds for this series. He 
was testing on a 5.10-based kernel - things have changed a lot since 
then and I am not really sure what the problem could have been there.

>> Anyway, this is just something which I think should be considered - 
>> which I guess already has been.
> 
> Indeed, I would tend to assume that machines with hundreds of CPUs are 
> less likely to be constrained on overall memory and/or IOVA space, 

Cheers,
John