[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3c6e53ae-6998-47f8-ae37-9e68553ad918@amd.com>
Date: Fri, 10 Jan 2025 15:47:55 +0100
From: Christian König <christian.koenig@....com>
To: Philipp Reisner <philipp.reisner@...bit.com>
Cc: Alex Deucher <alexdeucher@...il.com>, dri-devel@...ts.freedesktop.org,
linux-kernel@...r.kernel.org, Simona Vetter <simona@...ll.ch>
Subject: Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
Am 10.01.25 um 15:32 schrieb Philipp Reisner:
> [...]
>> Take a look at those messages right before the crash:
>>
>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
>> skipping
>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
>> skipping
>>
>> That is basically a 100% certain confirm that an application tries to
>> use the device before before those compute queues are resumed.
>>
>> Can I have a full dmesg? Maybe the resume is canceled or aborted for
>> some reason.
>>
> Yes, of course. I have made the files available here:
> https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa
Ah! That suddenly makes much more sense.
Here is the root cause:
[111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
[111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
[111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
[111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
[111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)
[111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
[amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
[111315.207293] [drm] UVD and UVD ENC initialized successfully.
[111315.308270] [drm] VCE initialized successfully.
[111315.447494] PM: resume devices took 2.306 seconds
[111315.447865] OOM killer enabled.
I'm surprised that this works at all. For some reason the graphics queue
works, but the compute queues fail to resume.
@Alex what do we do about that? We could return an error when not all
rings come up again after resume, but that will probably result in a
number of complains.
Regards,
Christian.
>
> best regards,
> Philipp
Powered by blists - more mailing lists