linux-kernel - Re: Radeon regression in 6.6 kernel

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <05007cb0-871e-4dc7-af58-1351f4ba43e2@gmail.com>
Date:   Wed, 29 Nov 2023 11:20:58 -0500
From:   Luben Tuikov <ltuikov89@...il.com>
To:     Alex Deucher <alexdeucher@...il.com>
Cc:     Phillip Susi <phill@...susis.net>,
        Linux regressions mailing list <regressions@...ts.linux.dev>,
        Christian König <ckoenig.leichtzumerken@...il.com>,
        linux-kernel@...r.kernel.org,
        "amd-gfx@...ts.freedesktop.org" <amd-gfx@...ts.freedesktop.org>,
        dri-devel@...ts.freedesktop.org,
        Alex Deucher <alexander.deucher@....com>,
        Christian König <christian.koenig@....com>,
        Danilo Krummrich <dakr@...hat.com>
Subject: Re: Radeon regression in 6.6 kernel

On 2023-11-29 08:50, Alex Deucher wrote:
> On Tue, Nov 28, 2023 at 11:45 PM Luben Tuikov <ltuikov89@...il.com> wrote:
>>
>> On 2023-11-28 17:13, Alex Deucher wrote:
>>> On Mon, Nov 27, 2023 at 6:24 PM Phillip Susi <phill@...susis.net> wrote:
>>>>
>>>> Alex Deucher <alexdeucher@...il.com> writes:
>>>>
>>>>>> In that case those are the already known problems with the scheduler
>>>>>> changes, aren't they?
>>>>>
>>>>> Yes.  Those changes went into 6.7 though, not 6.6 AFAIK.  Maybe I'm
>>>>> misunderstanding what the original report was actually testing.  If it
>>>>> was 6.7, then try reverting:
>>>>> 56e449603f0ac580700621a356d35d5716a62ce5
>>>>> b70438004a14f4d0f9890b3297cd66248728546c
>>>>
>>>> At some point it was suggested that I file a gitlab issue, but I took
>>>> this to mean it was already known and being worked on.  -rc3 came out
>>>> today and still has the problem.  Is there a known issue I could track?
>>>>
>>>
>>> At this point, unless there are any objections, I think we should just
>>> revert the two patches
>> Uhm, no.
>>
>> Why "the two" patches?
>>
>> This email, part of this thread,
>>
>> https://lore.kernel.org/all/87r0kircdo.fsf@vps.thesusis.net/
>>
>> clearly states that reverting *only* this commit,
>> 56e449603f0ac5 drm/sched: Convert the GPU scheduler to variable number of run-queues
>> *does not* mitigate the failed suspend. (Furthermore, this commit doesn't really change
>> anything operational, other than using an allocated array, instead of a static one, in DRM,
>> while the 2nd patch is solely contained within the amdgpu driver code.)
>>
>> Leaving us with only this change,
>> b70438004a14f4 drm/amdgpu: move buffer funcs setting up a level
>> to be at fault, as the kernel log attached in the linked email above shows.
>>
>> The conclusion is that only b70438004a14f4 needs reverting.
> 
> b70438004a14f4 was a fix for 56e449603f0ac5.  Without b70438004a14f4,
> 56e449603f0ac5 breaks amdgpu.

It doesn't "break" it, amdgpu just needs to be fixed.

I know we put in a Fixes tag in 
b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level"
pointing to 56e449603f0ac5 "drm/sched: Convert the GPU scheduler to variable number of run-queues",
but given the testing Phillip has done, the culprit is wholly contained in
the amdgpu driver code.

No other driver has this problem since commit 56e449603f0ac5.

The Fixes tag in b70438004a14f4 "drm/amdgpu: move buffer funcs setting up a level" should've ideally
pointed to an amdgpu-driver code commit only (perhaps an old-old commit), and I was a bit uncomfortable
putting in a Fixes tag which pointed to drm code, but we did it so that the amdgpu commit follows
the changes in DRM. In retrospect, the Fixes tag should've pointed to and amdgpu-driver commit when
that the amdgpu code was originally written.

I remember that the problem was really that amdgpu called drm_sched_entity_init(),
in amdgpu_ttm_set_buffer_funcs_status() without actually having initialized the scheduler
used therein. For instance, the code before commit b70438004a14f4, looked like this:

void amdgpu_ttm_set_buffer_funcs_status(struct amdgpu_device *adev, bool enable)
{
	struct ttm_resource_manager *man = ttm_manager_type(&adev->mman.bdev, TTM_PL_VRAM);
	uint64_t size;
	int r;

	if (!adev->mman.initialized || amdgpu_in_reset(adev) ||
	    adev->mman.buffer_funcs_enabled == enable)
		return;

	if (enable) {
		struct amdgpu_ring *ring;
		struct drm_gpu_scheduler *sched;

		ring = adev->mman.buffer_funcs_ring;
		sched = &ring->sched;                             <-- LT: No one has initialized this scheduler
		r = drm_sched_entity_init(&adev->mman.entity,     <-- Oopses, now that sched->sched_rq is not a static array
					  DRM_SCHED_PRIORITY_KERNEL, &sched,
					  1, NULL);
		if (r) {
			DRM_ERROR("Failed setting up TTM BO move entity (%d)\n",
				  r);
			return;
		}


Before commit 56e449603f0ac5, amdgpu was getting away with this, because the sched->sched_rq
was a static array.

Ideally, amdgpu code would be fixed.
-- 
Regards,
Luben

Download attachment "OpenPGP_0x4C15479431A334AF.asc" of type "application/pgp-keys" (665 bytes)

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (237 bytes)