[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <eb5f3198-7625-40f4-bc23-cac969664e85@amd.com>
Date: Wed, 8 Jan 2025 09:19:07 +0100
From: Christian König <christian.koenig@....com>
To: Philipp Reisner <philipp.reisner@...bit.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
Simona Vetter <simona@...ll.ch>
Subject: Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume
Am 07.01.25 um 16:21 schrieb Philipp Reisner:
> [...]
>>> The OOPS happens because the rq member of entity is NULL in
>>> drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
>>>
>>> In drm_sched_entity_select_rq(), the code considers that
>>> drb_sched_pick_best() might return a NULL value. When NULL, it assigns
>>> NULL to entity->rq even if it had a non-NULL value before.
>>>
>>> drm_sched_job_arm() does not deal with entities having a rq of NULL.
>>>
>>> Fix this by leaving the entity on the engine it was instead of
>>> assigning a NULL to its run queue member.
>> Well that is clearly not the correct approach to fixing this. So clearly
>> a NAK from my side.
>>
>> The real question is why is amdgpu_cs_ioctl() called when all of
>> userspace should be frozen?
>>
>> Regards,
>> Christian.
>>
> Could the OOPS happen at resume time? Might it be that the kernel
> activates user-space
> before all the components of the GPU finished their wakeup?
>
> Maybe drm_sched_pick_best() returns NULL since no scheduler is ready yet?
Yeah that is exactly what I meant. It looks like either the suspend or
the resume order is somehow messed up.
In other words either some application tries to submit GPU work while it
should already been stopped, or it tries to submit GPU work before it is
started.
> Apart from whether amdgpu_cs_ioctl() should run at this point, I still think the
> suggested change improves the code. drm_sched_pick_best() can return NULL.
> drm_sched_entity_select_rq() can handle the NULL (partially).
>
> drm_sched_job_arm() crashes on an entity that has rq set to NULL.
Which is actually not the worst outcome :)
With your patch applied we don't immediately crash any more in the
submission path, but the whole system could then later deadlock because
the core memory management waits for a GPU submission which never returns.
That is an even worse situation because you then can't pinpoint any more
where that is coming from.
> The handling of NULL values is half-baked.
>
> In my opinion, you should define if drm_sched_pick_best() may put a NULL into
> rq. If your answer is yes, it might put a NULL there; then, there should be a
> BUG_ON(!entity->rq) after the invocation of drm_sched_entity_select_rq().
> If your answer is no, the BUG_ON() should be in drm_sched_pick_best().
Yeah good point.
We might not want a BUG_ON(), that is only justified when we prevent
further damage (e.g. random data corruption or similar).
I suggest using a WARN(!shed, "Submission without activated sheduler!").
This way the system has at least a chance of survival should the
scheduler become ready later on.
On the other hand the BUG_ON() or the NULL pointer deref should only
kill the application thread which is submitting something before the
driver is resumed. So that might help to pinpoint where the actually
issue is.
Regards,
Christian.
>
> That helps guys with zero domain knowledge, like me, to figure out how
> this is all
> supposed to work.
>
> best regards,
> Philipp
Powered by blists - more mailing lists