netdev - Re: AMD IOMMU problem after NIC uses multi-page allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <202ea27f-aa79-66b6-0c80-ba0459eef5bd@arm.com>
Date:   Thu, 30 Mar 2023 14:04:22 +0100
From:   Robin Murphy <robin.murphy@....com>
To:     Joerg Roedel <joro@...tes.org>, Jakub Kicinski <kuba@...nel.org>,
        Vasant Hegde <vasant.hegde@....com>
Cc:     Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
        iommu@...ts.linux.dev,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Willem de Bruijn <willemb@...gle.com>,
        Saeed Mahameed <saeed@...nel.org>
Subject: Re: AMD IOMMU problem after NIC uses multi-page allocation

On 2023-03-30 08:41, Joerg Roedel wrote:
> Also adding Vasant and Robin.
> 
> Vasant, Robin, any idea?
> 
> On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote:
>> Hi Joerg, Suravee,
>>
>> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19).
>>
>> The NIC allocates a buffer for Rx packets which is MTU rounded up
>> to page size. If I run it with 1500B MTU or 9000 MTU everything is
>> fine, slight but manageable perf hit.
>>
>> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k
>> - 70%+ of CPU cycles are spent in alloc_iova (and children).
>>
>> Does this ring any bells?

There is that old issue already mentioned where there seems to be some 
interplay between the IOVA caching and the lazy flush queue, which we 
never really managed to get to the bottom of. IIRC my hunch was that 
with a sufficiently large number of CPUs, fq_flush_timeout() overwhelms 
the rcache depot and gets into a pathological state where it then 
continually thrashes the IOVA rbtree in a fight with the caching system.

Another (simpler) possibility which comes to mind is if the 9K MTU 
(which I guess means 16KB IOVA allocations) puts you up against the 
threshold of available 32-bit IOVA space - if you keep using the 16K 
entries then you'll mostly be recycling them out of the IOVA caches, 
which is nice and fast. However once you switch back to 1500 so needing 
2KB IOVAs, you've now got a load of IOVA space hogged by all the 16KB 
entries that are now hanging around in caches, which could push you into 
the case where the optimistic 32-bit allocation starts to fail (but 
because it *can* fall back to a 64-bit allocation, it's not going to 
purge those unused 16KB entries to free up more 32-bit space). If the 
32-bit space then *stays* full, alloc_iova should stay in fail-fast 
mode, but if some 2KB allocations were below 32 bits and eventually get 
freed back to the tree, then subsequent attempts are liable to spend 
ages doing doing their best to scrape up all the available 32-bit space 
until it's definitely full again. For that case, [1] should help.

Even in the second case, though, I think hitting the rbtree much at all 
still implies that the caches might not be well-matched to the 
workload's map/unmap pattern, and maybe scaling up the depot size could 
still be the biggest win.

Thanks,
Robin.

[1] 
https://lore.kernel.org/linux-iommu/e9abc601b00e26fd15a583fcd55f2a8227903077.1674061620.git.robin.murphy@arm.com/