lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <705f90c3-b933-8863-2124-3fea7fdbd81a@arm.com>
Date:   Wed, 2 Jun 2021 13:48:54 +0100
From:   Robin Murphy <robin.murphy@....com>
To:     Daniel Borkmann <daniel@...earbox.net>,
        Jussi Maki <joamaki@...il.com>
Cc:     jroedel@...e.de, netdev@...r.kernel.org, bpf <bpf@...r.kernel.org>,
        intel-wired-lan@...ts.osuosl.org, davem@...emloft.net,
        anthony.l.nguyen@...el.com, jesse.brandeburg@...el.com, hch@....de,
        iommu@...ts.linux-foundation.org, suravee.suthikulpanit@....com,
        gregkh@...uxfoundation.org
Subject: Re: Regression 5.12.0-rc4 net: ice: significant throughput drop

On 2021-06-02 09:09, Daniel Borkmann wrote:
> On 6/1/21 7:42 PM, Jussi Maki wrote:
>> Hi Robin,
>>
>> On Tue, Jun 1, 2021 at 2:39 PM Robin Murphy <robin.murphy@....com> wrote:
>>>>> The regression shows as a significant drop in throughput as measured
>>>>> with "super_netperf" [0],
>>>>> with measured bandwidth of ~95Gbps before and ~35Gbps after:
>>>
>>> I guess that must be the difference between using the flush queue
>>> vs. strict invalidation. On closer inspection, it seems to me that
>>> there's a subtle pre-existing bug in the AMD IOMMU driver, in that
>>> amd_iommu_init_dma_ops() actually runs *after* amd_iommu_init_api()
>>> has called bus_set_iommu(). Does the patch below work?
>>
>> Thanks for the quick response & patch. I tried it out and indeed it
>> does solve the issue:

Cool, thanks Jussi. May I infer a Tested-by tag from that?

>> # uname -a
>> Linux zh-lab-node-3 5.13.0-rc3-amd-iommu+ #31 SMP Tue Jun 1 17:12:57
>> UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>> root@...lab-node-3:~# ./super_netperf 32 -H 172.18.0.2
>> 95341.2
>>
>> root@...lab-node-3:~# uname -a
>> Linux zh-lab-node-3 5.13.0-rc3-amd-iommu-unpatched #32 SMP Tue Jun 1
>> 17:29:34 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
>> root@...lab-node-3:~# ./super_netperf 32 -H 172.18.0.2
>> 33989.5
> 
> Robin, probably goes without saying, but please make sure to include ...
> 
> Fixes: a250c23f15c2 ("iommu: remove DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE")
> 
> ... to your fix in [0], maybe along with another Fixes tag pointing to 
> the original
> commit adding this issue. But certainly a250c23f15c2 would be good given 
> the regression
> was uncovered on that one first, so that Greg et al have a chance to 
> pick this fix up
> for stable kernels.

Given that the race looks to have been pretty theoretical until now, I'm 
not convinced it's worth the bother of digging through the long history 
of default domain and DMA ops movement to figure where it started, much 
less attempt invasive backports. The flush queue change which made it 
apparent only landed in 5.13-rc1, so as long as we can get this in as a 
fix in the current cycle we should be golden - in the meantime, note 
that booting with "iommu.strict=0" should also restore the expected 
behaviour.

FWIW I do still plan to resend the patch "properly" soon (in all honesty 
it wasn't even compile-tested!)

Cheers,
Robin.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ