linux-kernel - Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADnq5_Md7XWOHSru-0gR8xys3nwG4whSng-LcxafG3SiC8G0qw@mail.gmail.com>
Date: Fri, 10 Jan 2025 10:10:01 -0500
From: Alex Deucher <alexdeucher@...il.com>
To: Christian König <christian.koenig@....com>
Cc: Philipp Reisner <philipp.reisner@...bit.com>, dri-devel@...ts.freedesktop.org, 
	linux-kernel@...r.kernel.org, Simona Vetter <simona@...ll.ch>
Subject: Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

On Fri, Jan 10, 2025 at 9:48 AM Christian König
<christian.koenig@....com> wrote:
>
> Am 10.01.25 um 15:32 schrieb Philipp Reisner:
> > [...]
> >> Take a look at those messages right before the crash:
> >>
> >> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready,
> >> skipping
> >> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready,
> >> skipping
> >>
> >> That is basically a 100% certain confirm that an application tries to
> >> use the device before before those compute queues are resumed.
> >>
> >> Can I have a full dmesg? Maybe the resume is canceled or aborted for
> >> some reason.
> >>
> > Yes, of course. I have made the files available here:
> > https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa
>
> Ah! That suddenly makes much more sense.
>
> Here is the root cause:
>
> [111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110)
> [111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
> [111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110)
> [111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110)
> [111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110)
> [111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper
> [amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110)
> [111315.207293] [drm] UVD and UVD ENC initialized successfully.
> [111315.308270] [drm] VCE initialized successfully.
> [111315.447494] PM: resume devices took 2.306 seconds
> [111315.447865] OOM killer enabled.
>
> I'm surprised that this works at all. For some reason the graphics queue
> works, but the compute queues fail to resume.
>
> @Alex what do we do about that? We could return an error when not all
> rings come up again after resume, but that will probably result in a
> number of complains.

Maybe return an error if all of the rings of a particular type fail,
but if only some do, we should be able to deal with that.  We
currently set up 8 compute rings.  We probably don't need that many.
Maybe just two (high and low priority).

Alex

>
> Regards,
> Christian.
>
>
> >
> > best regards,
> >   Philipp
>