linux-kernel - Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <DADO8D07ZTFD.1A1L9QSSMDTXR@kode54.net>
Date: Wed, 04 Jun 2025 03:19:35 -0700
From: "Christopher Snowhill" <chris@...e54.net>
To: "Philipp Reisner" <philipp.reisner@...bit.com>
Cc: Christian König <christian.koenig@....com>, "Philipp
 Stanner" <pstanner@...hat.com>, <dri-devel@...ts.freedesktop.org>,
 <linux-kernel@...r.kernel.org>, "Simona Vetter" <simona@...ll.ch>, "Danilo
 Krummrich" <dakr@...nel.org>, "Philipp Stanner" <phasta@...nel.org>,
 "dri-devel" <dri-devel-bounces@...ts.freedesktop.org>
Subject: Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

On Mon Jun 2, 2025 at 3:25 AM PDT, Philipp Reisner wrote:
> Hi Christopher,
>
> Thanks for following up. The bug still annoys me from time to time.
> It triggered last on May 8, May 12, and May 18.
> The crash on May 18 was already with the 6.14.5 kernel.
>
>> Could this sleep wake issue also be caused by a similar thing to the
>> panics and SMU hangs I was experiencing with my own issue? It's an issue
>> known to have the same workaround for both 6000 and 7000 series users. A
>> specific kernel commit seems to affect it as well.
>>
>
> I posted the stack trace earlier in the thread. The question is, what
> was the stack
> trace of the issue you are referring to?
>
>>
>> If you could test whether you can still reproduce the error after
>> disabling GFXOFF states with the following kernel commandline override:
>>
>> amdgpu.ppfeaturemask=0xfff73fff
>>
>
> that disables PP_OVERDRIVE_MASK, PP_GFXOFF_MASK,
> and PP_GFX_DCS_MASK.
>
> IMHO, that looks like a mitigation for something different than the non-ready
> compute schedulers that seem to be the root cause for the NULL pointer derefs
> in my case.

Indeed, it's mitigating something that leads to SMU firmware hangs. I
made a guess, I probably guessed poorly, that your compute units may be
failing to wake up due to a SMU hang. But you have no SMU hang log
notices, so it's probably not that. Oh well.

>
> Anyhow, I will give it a try, and will report back if my workstation
> does not deref
> NULL pointers for more than three weeks with that amdgpu.ppfeaturemask set.
>
> Best regards,
>  Philipp