lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ede4dabb-d3e9-45bf-8e56-aebbb8a37ae5@amd.com>
Date: Mon, 13 Jan 2025 09:32:13 +0100
From: Christian König <christian.koenig@....com>
To: Alex Deucher <alexdeucher@...il.com>
Cc: Philipp Reisner <philipp.reisner@...bit.com>,
 dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
 Simona Vetter <simona@...ll.ch>
Subject: Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

Am 10.01.25 um 16:10 schrieb Alex Deucher:
> On Fri, Jan 10, 2025 at 9:48 AM Christian König
> <christian.koenig@....com> wrote:
>> Am 10.01.25 um 15:32 schrieb Philipp Reisner:
>>> [...]
>>>> Take a look at those messages right before the crash:
>>>>
>>>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
>>>> skipping
>>>> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
>>>> skipping
>>>>
>>>> That is basically a 100% certain confirm that an application tries to
>>>> use the device before before those compute queues are resumed.
>>>>
>>>> Can I have a full dmesg? Maybe the resume is canceled or aborted for
>>>> some reason.
>>>>
>>> Yes, of course. I have made the files available here:
>>> https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa
>> Ah! That suddenly makes much more sense.
>>
>> Here is the root cause:
>>
>> [111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
>> [111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
>> [111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
>> [111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
>> [111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)
>> [111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
>> [amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
>> [111315.207293] [drm] UVD and UVD ENC initialized successfully.
>> [111315.308270] [drm] VCE initialized successfully.
>> [111315.447494] PM: resume devices took 2.306 seconds
>> [111315.447865] OOM killer enabled.
>>
>> I'm surprised that this works at all. For some reason the graphics queue
>> works, but the compute queues fail to resume.
>>
>> @Alex what do we do about that? We could return an error when not all
>> rings come up again after resume, but that will probably result in a
>> number of complains.
> Maybe return an error if all of the rings of a particular type fail,
> but if only some do, we should be able to deal with that.  We
> currently set up 8 compute rings.  We probably don't need that many.
> Maybe just two (high and low priority).

Reducing the number of queues would make the problem even more severe 
instead of helping since you then have even less chance of successfully 
resuming.

Currently we don't abort resume when the compute queues don't resume, 
but this leads to a crash later on.

The issue is that when we start to abort resume the end user experience 
doesn't really improve, we just avoid the crash.

Either we need to tell Mesa to stop using the compute queues by default 
(what is that good for anyway?) or we need to get the compute queues 
reliable working after a resume.

Christian.

>
> Alex
>
>> Regards,
>> Christian.
>>
>>
>>> best regards,
>>>    Philipp


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ