linux-kernel - Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ede4dabb-d3e9-45bf-8e56-aebbb8a37ae5@amd.com>
Date: Mon, 13 Jan 2025 09:32:13 +0100
From: Christian König <christian.koenig@....com>
To: Alex Deucher <alexdeucher@...il.com>
Cc: Philipp Reisner <philipp.reisner@...bit.com>,
 dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
 Simona Vetter <simona@...ll.ch>
Subject: Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

Am 10.01.25 um 16:10 schrieb Alex Deucher:
> On Fri, Jan 10, 2025 at 9:48 AM Christian König
> <christian.koenig@....com> wrote:
>> Am 10.01.25 um 15:32 schrieb Philipp Reisner:
>>> [...]
>>>> Take a look at those messages right before the crash:
>>>>
>>>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
>>>> skipping
>>>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
>>>> skipping
>>>>
>>>> That is basically a 100% certain confirm that an application tries to
>>>> use the device before before those compute queues are resumed.
>>>>
>>>> Can I have a full dmesg? Maybe the resume is canceled or aborted for
>>>> some reason.
>>>>
>>> Yes, of course. I have made the files available here:
>>> https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa
>> Ah! That suddenly makes much more sense.
>>
>> Here is the root cause:
>>
>> [111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
>> [111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
>> [111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
>> [111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
>> [111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)
>> [111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
>> [111315.207293] [drm] UVD and UVD ENC initialized successfully.
>> [111315.308270] [drm] VCE initialized successfully.
>> [111315.447494] PM: resume devices took 2.306 seconds
>> [111315.447865] OOM killer enabled.
>>
>> I'm surprised that this works at all. For some reason the graphics queue
>> works, but the compute queues fail to resume.
>>
>> @Alex what do we do about that? We could return an error when not all
>> rings come up again after resume, but that will probably result in a
>> number of complains.
> Maybe return an error if all of the rings of a particular type fail,
> but if only some do, we should be able to deal with that.  We
> currently set up 8 compute rings.  We probably don't need that many.
> Maybe just two (high and low priority).

Reducing the number of queues would make the problem even more severe 
instead of helping since you then have even less chance of successfully 
resuming.

Currently we don't abort resume when the compute queues don't resume, 
but this leads to a crash later on.

The issue is that when we start to abort resume the end user experience 
doesn't really improve, we just avoid the crash.

Either we need to tell Mesa to stop using the compute queues by default 
(what is that good for anyway?) or we need to get the compute queues 
reliable working after a resume.

Christian.

>
> Alex
>
>> Regards,
>> Christian.
>>
>>
>>> best regards,
>>>    Philipp