[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <202ea27f-aa79-66b6-0c80-ba0459eef5bd@arm.com>
Date: Thu, 30 Mar 2023 14:04:22 +0100
From: Robin Murphy <robin.murphy@....com>
To: Joerg Roedel <joro@...tes.org>, Jakub Kicinski <kuba@...nel.org>,
Vasant Hegde <vasant.hegde@....com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@....com>,
iommu@...ts.linux.dev,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Willem de Bruijn <willemb@...gle.com>,
Saeed Mahameed <saeed@...nel.org>
Subject: Re: AMD IOMMU problem after NIC uses multi-page allocation
On 2023-03-30 08:41, Joerg Roedel wrote:
> Also adding Vasant and Robin.
>
> Vasant, Robin, any idea?
>
> On Wed, Mar 29, 2023 at 06:14:07PM -0700, Jakub Kicinski wrote:
>> Hi Joerg, Suravee,
>>
>> I see an odd NIC behavior with AMD IOMMU in lazy mode (on 5.19).
>>
>> The NIC allocates a buffer for Rx packets which is MTU rounded up
>> to page size. If I run it with 1500B MTU or 9000 MTU everything is
>> fine, slight but manageable perf hit.
>>
>> But if I flip the MTU to 9k, run some traffic and then go back to 1.5k
>> - 70%+ of CPU cycles are spent in alloc_iova (and children).
>>
>> Does this ring any bells?
There is that old issue already mentioned where there seems to be some
interplay between the IOVA caching and the lazy flush queue, which we
never really managed to get to the bottom of. IIRC my hunch was that
with a sufficiently large number of CPUs, fq_flush_timeout() overwhelms
the rcache depot and gets into a pathological state where it then
continually thrashes the IOVA rbtree in a fight with the caching system.
Another (simpler) possibility which comes to mind is if the 9K MTU
(which I guess means 16KB IOVA allocations) puts you up against the
threshold of available 32-bit IOVA space - if you keep using the 16K
entries then you'll mostly be recycling them out of the IOVA caches,
which is nice and fast. However once you switch back to 1500 so needing
2KB IOVAs, you've now got a load of IOVA space hogged by all the 16KB
entries that are now hanging around in caches, which could push you into
the case where the optimistic 32-bit allocation starts to fail (but
because it *can* fall back to a 64-bit allocation, it's not going to
purge those unused 16KB entries to free up more 32-bit space). If the
32-bit space then *stays* full, alloc_iova should stay in fail-fast
mode, but if some 2KB allocations were below 32 bits and eventually get
freed back to the tree, then subsequent attempts are liable to spend
ages doing doing their best to scrape up all the available 32-bit space
until it's definitely full again. For that case, [1] should help.
Even in the second case, though, I think hitting the rbtree much at all
still implies that the caches might not be well-matched to the
workload's map/unmap pattern, and maybe scaling up the depot size could
still be the biggest win.
Thanks,
Robin.
[1]
https://lore.kernel.org/linux-iommu/e9abc601b00e26fd15a583fcd55f2a8227903077.1674061620.git.robin.murphy@arm.com/
Powered by blists - more mailing lists