[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c434b9b4-5a8c-a7a9-4e01-b6d8bd40b918@arm.com>
Date: Thu, 3 Jun 2021 14:09:46 +0100
From: Robin Murphy <robin.murphy@....com>
To: Jussi Maki <joamaki@...il.com>
Cc: jroedel@...e.de, Daniel Borkmann <daniel@...earbox.net>,
netdev@...r.kernel.org, jesse.brandeburg@...el.com, hch@....de,
iommu@...ts.linux-foundation.org, intel-wired-lan@...ts.osuosl.org,
gregkh@...uxfoundation.org, anthony.l.nguyen@...el.com,
bpf <bpf@...r.kernel.org>, davem@...emloft.net
Subject: Re: Regression 5.12.0-rc4 net: ice: significant throughput drop
On 2021-06-03 13:32, Jussi Maki wrote:
> On Wed, Jun 2, 2021 at 2:49 PM Robin Murphy <robin.murphy@....com> wrote:
>>>> Thanks for the quick response & patch. I tried it out and indeed it
>>>> does solve the issue:
>>
>> Cool, thanks Jussi. May I infer a Tested-by tag from that?
>
> Of course!
>
>> Given that the race looks to have been pretty theoretical until now, I'm
>> not convinced it's worth the bother of digging through the long history
>> of default domain and DMA ops movement to figure where it started, much
>> less attempt invasive backports. The flush queue change which made it
>> apparent only landed in 5.13-rc1, so as long as we can get this in as a
>> fix in the current cycle we should be golden - in the meantime, note
>> that booting with "iommu.strict=0" should also restore the expected
>> behaviour.
>>
>> FWIW I do still plan to resend the patch "properly" soon (in all honesty
>> it wasn't even compile-tested!)
>
> BTW, even with the patch there's quite a bit of spin lock contention
> coming from ice_xmit_xdp_ring->dma_map_page_attrs->...->alloc_iova.
> CPU load drops from 85% to 20% (~80Mpps, 64b UDP) when iommu is
> disabled. Is this type of overhead to be expected?
Yes, IOVA allocation can still be a bottleneck - the percpu caching
system mostly alleviates it, but certain workloads can still defeat
that, and if you're spending significant time in alloc_iova() rather
than alloc_iova_fast() then it sounds like yours is one of them.
If you're using small IOVA sizes which *should* be cached, then you
might be running into a pathological case of thrashing the global depot.
I've ranted before about the fixed MAX_GLOBAL_MAGS probably being too
small for systems with more than 16 CPUs, which on a modern AMD system I
imagine you may well have.
If on the other hand your workload is making larger mappings above the
IOVA caching threshold, then please take a look at John's series for
making that tuneable:
https://lore.kernel.org/linux-iommu/1622557781-211697-1-git-send-email-john.garry@huawei.com/
Cheers,
Robin.
Powered by blists - more mailing lists