lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <35c599a3-0042-4f00-52e4-9d17164b93b1@amd.com>
Date:   Fri, 20 Apr 2018 15:40:20 -0400
From:   Felix Kuehling <felix.kuehling@....com>
To:     Michel Dänzer <michel@...nzer.net>,
        Christian König <christian.koenig@....com>,
        Gabriel C <nix.or.die@...il.com>,
        Philip Yang <Philip.Yang@....com>
Cc:     Jean-Marc Valin <jmvalin@...illa.com>,
        Dave Airlie <airlied@...ux.ie>,
        LKML <linux-kernel@...r.kernel.org>,
        dri-devel@...ts.freedesktop.org, alexander.deucher@....com,
        Andrew Morton <akpm@...ux-foundation.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: AMD graphics performance regression in 4.15 and later

[+Philip]

On 2018-04-20 10:47 AM, Michel Dänzer wrote:
> On 2018-04-11 11:37 AM, Christian König wrote:
>> Am 11.04.2018 um 06:00 schrieb Gabriel C:
>>> 2018-04-09 11:42 GMT+02:00 Christian König
>>> <ckoenig.leichtzumerken@...il.com>:
>>>> Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin:
>>>>> Hi Christian,
>>>>>
>>>>> Thanks for the info. FYI, I've also opened a Firefox bug for that at:
>>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
>>>>> Feel free to comment since you have a better understanding of what's
>>>>> going on.
>>>>>
>>>>> One last question: right now I'm running 4.15.0 with the "offending"
>>>>> patch reverted. Is that safe to run or are there possible bad
>>>>> interactions with other changes.
>>>> That should work without problems.
>>>>
>>>> But I just had another idea as well, if you want you could still test
>>>> the
>>>> new code path which will be using in 4.17.
>>>>
>>> While Firefox may do some strange things is not about only Firefox.
>>>
>>> With your patches my EPYC box is unusable with  4.15++ kernels.
>>> The whole Desktop is acting weird.  This one is using
>>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
>>>
>>> Box is  2 * EPYC 7281 with 128 GB ECC RAM
>>>
>>> Also a 14C Xeon box with a HD7700 is broken same way.
>> The hardware is irrelevant for this. We need to know what software stack
>> you use on top of it.
>>
>> E.g. desktop environment/Mesa and DDX version etc...
>>
>>> Everything breaks in X .. scrolling , moving windows , flickering etc.
>>>
>>>
>>> reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and
>>> 648bc3574716400acc06f99915815f80d9563783
>>> from an 4.15 kernel makes things work again.
>>>
>>>
>>>> Backporting all the detection logic is to invasive, but you could
>>>> just go
>>>> into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other
>>>> code path.
>>>>
>>>> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.
>>>>
>>> Well you really can't be serious about these suggestions ? Are you ?
>>>
>>> Telling peoples to #if 0 random code is not a solution.
>> That is for testing and not a permanent solution.
>>
>>> You broke existsing working userland with your patches and at least
>>> please fix that for 4.16.
>>>
>>> I can help testing code for 4.17/++ if you wish but that is
>>> *different* storry.
>> Please test Alex's amd-staging-drm-next branch from
>> git://people.freedesktop.org/~agd5f/linux.
> I think we're still missing something here.
>
> I'm currently running 4.16.2 + the DRM subsystem changes which are going
> into 4.17 (so I have the changes Christian is referring to) with a
> Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc.
> Some observations:
>
> Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the
> order of a minute, during which the kernel is spending most of one
> core's cycles inside alloc_pages (__alloc_pages_nodemask to be more
> precise), called from ttm_alloc_new_pages.
Philip debugged a similar problem with a KFD memory stress test about
two weeks ago, where the kernel was seemingly stuck in an infinite loop
trying to allocate huge pages. I'm pasting his analysis for the record:

> [...] it uses huge_flags GFP_TRANSHUGE to call alloc_pages(), this
> seems a corner case inside __alloc_pages_slowpath(), it never exits
> but goes to retry path every time. It can reclaim pages and
> did_some_progress (as a result, no_progress_loops is reset to 0 every
> loop, never reach MAX_RECLAIM_RETRIES) but cannot finish huge page
> allocations under this specific memory pressure.  
As a workaround to unblock our release branch testing we removed
transparent huge page allocation from  ttm_get_pages. We're seeing this
as far back as 4.13 on our release branch.

If we're really talking about the same problem, I don't think it's
caused by recent page allocator changes, but rather exposed by recent
TTM changes.

Regards,
  Felix

>
> At least in the case of Firefox, this happens due to Mesa internal BO
> allocations for glTex(Sub)Image, so it's not obvious that Firefox is
> doing something wrong.
>
> I never noticed this before this week. Before, I was running 4.15.y +
> DRM subsystem changes from 4.16. Maybe something has changed in core
> code, trying harder to allocate huge pages.
>
>
> Maybe TTM should only try to use any huge pages that happen to be
> available, not spend any (/ "too much"?) additional effort trying to
> free up huge pages?
>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ