lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 21 Jan 2020 09:21:50 -0800 From: Cong Wang <xiyou.wangcong@...il.com> To: Robin Murphy <robin.murphy@....com> Cc: iommu@...ts.linux-foundation.org, LKML <linux-kernel@...r.kernel.org>, Joerg Roedel <joro@...tes.org>, John Garry <john.garry@...wei.com> Subject: Re: [Patch v3 1/3] iommu: avoid unnecessary magazine allocations On Tue, Jan 21, 2020 at 3:11 AM Robin Murphy <robin.murphy@....com> wrote: > > On 18/12/2019 4:39 am, Cong Wang wrote: > > The IOVA cache algorithm implemented in IOMMU code does not > > exactly match the original algorithm described in the paper > > "Magazines and Vmem: Extending the Slab Allocator to Many > > CPUs and Arbitrary Resources". > > > > Particularly, it doesn't need to free the loaded empty magazine > > when trying to put it back to global depot. To make it work, we > > have to pre-allocate magazines in the depot and only recycle them > > when all of them are full. > > > > Before this patch, rcache->depot[] contains either full or > > freed entries, after this patch, it contains either full or > > empty (but allocated) entries. > > How much additional memory overhead does this impose (particularly on > systems that may have many domains mostly used for large, long-term > mappings)? I'm wary that trying to micro-optimise for the "churn network > packets as fast as possible" case may penalise every other case, > potentially quite badly. Lower-end embedded systems are using IOMMUs in > front of their GPUs, video codecs, etc. precisely because they *don't* > have much memory to spare (and thus need to scrape together large > buffers out of whatever pages they can find). The calculation is not complicated: 32 * 6 * 129 * 8 = 198144 bytes, which is roughly 192K, per domain. > > But on the other hand, if we were to go down this route, then why is > there any dynamic allocation/freeing left at all? Once both the depot > and the rcaches are preallocated, then AFAICS it would make more sense > to rework the overflow case in __iova_rcache_insert() to just free the > IOVAs and swap the empty mag around rather than destroying and > recreating it entirely. It's due to the algorithm requires a swap(), which can't be done with statically allocated magzine. I had the same thought initially but gave it up quickly when realized this. If you are suggesting to change the algorithm, it is not a goal of this patchset. I do have plan to search for a better algorithm as the IOMMU performance still sucks (comparing to no IOMMU) after this patchset, but once again, I do not want to change it in this patchset. (My ultimate goal is to find a spinlock-free algorithm, otherwise there is no way to make it close to no-IOMMU performance.) > > Perhaps there's a reasonable compromise wherein we don't preallocate, > but still 'free' empty magazines back to the depot, such that busy > domains will quickly reach a steady-state. In fact, having now dug up > the paper at this point of writing this reply, that appears to be what > fig. 3.1b describes anyway - I don't see any mention of preallocating > the depot. That paper missed a lot of things, it doesn't even recommend a size of a depot or percpu cache. For implementation, we still have to think about those details, including whether to preallocate memory. Thanks.
Powered by blists - more mailing lists