linux-kernel - Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1aa1ecad-bdf0-84c8-a37f-94e1d0fb8a03@oracle.com>
Date:   Tue, 15 Aug 2023 14:35:37 +0100
From:   John Garry <john.g.garry@...cle.com>
To:     Robin Murphy <robin.murphy@....com>, joro@...tes.org
Cc:     will@...nel.org, iommu@...ts.linux.dev,
        linux-kernel@...r.kernel.org, zhangzekun11@...wei.com
Subject: Re: [PATCH 0/2] iommu/iova: Make the rcache depot properly flexible

On 15/08/2023 12:11, Robin Murphy wrote:
>>
>> This threshold is the number of online CPUs, right?
> 
> Yes, that's nominally half of the current fixed size (based on all the 
> performance figures from the original series seemingly coming from a 
> 16-thread machine, 

If you are talking about 
https://lore.kernel.org/linux-iommu/20230811130246.42719-1-zhangzekun11@huawei.com/, 
then I think it's a 256-CPU system and the DMA controller has 16 HW 
queues. The 16 HW queues are relevant as the per-completion queue 
interrupt handler runs on a fixed CPU from the set of 16 CPUs in the HW 
queue interrupt handler affinity mask. And what this means is while any 
CPU may alloc an IOVA, only those 16 CPUs handling each HW queue 
interrupt will be free'ing IOVAs.

> but seemed like a fair compromise. I am of course 
> keen to see how real-world testing actually pans out.
> 
>>> it's enough of a challenge to get my 4-core dev board with spinning disk
>>> and gigabit ethernet to push anything into a depot at all 😄
>>>
>>
>> I have to admit that I was hoping to also see a more aggressive 
>> reclaim strategy, where we also trim the per-CPU rcaches when not in 
>> use. Leizhen proposed something like this a long time ago.
> 
> Don't think I haven't been having various elaborate ideas for making it 
> cleverer with multiple thresholds and self-tuning, however I have 
> managed to restrain myself 😉
> 

OK, understood. My main issue WRT scalability is that the total 
cacheable IOVAs (CPU and depot rcache) scales up with the number of 
CPUs, but many DMA controllers have a fixed number of max in-flight 
requests.

Consider a SCSI storage controller on a 256-CPU system. The in-flight 
limit for this example controller is 4096, which would typically never 
be even used up or may not be even usable.

For this device, we need 4096 * 6 [IOVA rcache range] = ~24K cached 
IOVAs if we were to pre-allocate them all - obviously I am ignoring that 
we have the per-CPU rcache for speed and it would not make sense to 
share one set. However, according to current IOVA driver, we can in 
theory cache upto ((256 [CPUs] * 2 [loaded + prev]) + 32 [depot size]) * 
6 [rcache range] * 128 (IOVA per mag) = ~420K IOVAs. That's ~17x what we 
would ever need.

Something like NVMe is different, as its total requests can scale up 
with the CPU count, but only to a limit. I am not sure about network 
controllers.

Anyway, this is just something which I think should be considered - 
which I guess already has been.

> At this point I'm just looking to confirm whether the fundamental 
> concepts are sound, and at least no worse than the current behaviour 
> (hence keeping it split into 2 distinct patches for the sake of review 
> and debugging). If it proves solid then we can absolutely come back and 
> go to town on enhancements later.

Thanks,
John