linux-kernel - Re: AMD graphics performance regression in 4.15 and later

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d136f940-d807-d2f7-30a7-8617f130912d@amd.com>
Date:   Fri, 6 Apr 2018 19:20:46 +0200
From:   Christian König <christian.koenig@....com>
To:     Jean-Marc Valin <jmvalin@...illa.com>, airlied@...ux.ie,
        alexander.deucher@....com, Felix.Kuehling@....com,
        labbott@...hat.com, akpm@...ux-foundation.org,
        michel.daenzer@....com, dri-devel@...ts.freedesktop.org,
        linux-kernel@...r.kernel.org
Subject: Re: AMD graphics performance regression in 4.15 and later

Am 06.04.2018 um 18:42 schrieb Jean-Marc Valin:
> Hi Christian,
>
> On 04/09/2018 07:48 AM, Christian König wrote:
>> Am 06.04.2018 um 17:30 schrieb Jean-Marc Valin:
>>> Hi Christian,
>>>
>>> Is there a way to turn off these huge pages at boot-time/run-time?
>> Only at compile time by not setting CONFIG_TRANSPARENT_HUGEPAGE.
> Any reason why
> echo never > /sys/kernel/mm/transparent_hugepage/enabled
> doesn't solve the problem?

Because we unfortunately try to allocate huge pages anyway, we 
unfortunately just fail in 100% of all cases.

That basically gives you both, the extra allocation overhead and the 
still bad throughput.

> Also, I assume that disabling CONFIG_TRANSPARENT_HUGEPAGE will disable
> them for everything and not just what your patch added, right?

Correct, that's why I wrote that disabling SWIOTLBs might be better.

>>> I'm not sure what you mean by "We mitigated the problem by avoiding the
>>> slow coherent DMA code path on almost all platforms on newer kernels". I
>>> tested up to 4.16 and the performance regression is just as bad as it is
>>> for 4.15.
>> Indeed 4.16 still doesn't have that. You could use the
>> amd-staging-drm-next branch or wait for 4.17.
> Is there a way to pull just that change or is there too much
> interactions with other changes?

It adds a new detection if memory allocation needs to be coherent or 
not, that is not something you can easily pull into older versions.

>> That isn't related to the GFX hardware, but to your CPU/motherboard and
>> whatever else you have in the system.
> Well, I have an nvidia GPU in the same system (normally only used for
> CUDA) and if I use it instead of my RX 560 then I'm not seeing any
> performance issue with 4.15.

That's because you are probably using the Nvidia binary driver which has 
a completely separate code base.

>> Some part of your system needs SWIOTLB and that makes allocating memory
>> much slower.
> What would that part be? FTR, I have a complete description of my system
> at https://jmvalin.dreamwidth.org/15583.html
>
> I don't know if it's related, but I can maybe see one thing in common
> between my machine and the Core 2 Quad from the other bug report and
> that's the "NUMA part". I have a dual-socket Xeon and (AFAIK) the Core 2
> Quad is made of two two-core CPUs glued together with little
> communication between them.

Yeah, that is probably the reason.

>> Intel doesn't use TTM because they don't have dedicated VRAM, but the
>> open source nvidia driver should be affected as well.
> I'm using the proprietary nvidia driver (because CUDA). Is that supposed
> to be affected as well?

No.

>> We already mitigated that problem and I don't see any solution which
>> will arrive faster than 4.17.
> Is that supposed to make the slowdown unnoticeable or just slightly better?

It completely goes away. The issue with the coherent path is that it 
tries to always allocate the lowest possible memory to make sure that it 
fits into the DMA constrains of all devices in the system.

But since AMD GPU can handle 40bits of addresses you would need at least 
1TB of memory in the system to trigger that (or a NUMA where some system 
is low and some in a high area).

Christian.

>> The only quick workaround I can see is to avoid firefox, chrome for
>> example is reported to work perfectly fine.
> Or use an unaffected GPU/driver ;-)
>
> Cheers,
>
> 	Jean-Marc
>