lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADGDV=U_7CdkdEiLX9kj9yHsXhwb5zP_eGXpwmrj20cmgzMAtA@mail.gmail.com>
Date: Tue, 7 Jan 2025 16:21:57 +0100
From: Philipp Reisner <philipp.reisner@...bit.com>
To: Christian König <christian.koenig@....com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org, 
	Simona Vetter <simona@...ll.ch>
Subject: Re: [PATCH] drm/sched: Fix amdgpu crash upon suspend/resume

[...]
> > The OOPS happens because the rq member of entity is NULL in
> > drm_sched_job_arm() after the call to drm_sched_entity_select_rq().
> >
> > In drm_sched_entity_select_rq(), the code considers that
> > drb_sched_pick_best() might return a NULL value. When NULL, it assigns
> > NULL to entity->rq even if it had a non-NULL value before.
> >
> > drm_sched_job_arm() does not deal with entities having a rq of NULL.
> >
> > Fix this by leaving the entity on the engine it was instead of
> > assigning a NULL to its run queue member.
>
> Well that is clearly not the correct approach to fixing this. So clearly
> a NAK from my side.
>
> The real question is why is amdgpu_cs_ioctl() called when all of
> userspace should be frozen?
>
> Regards,
> Christian.
>

Could the OOPS happen at resume time? Might it be that the kernel
activates user-space
before all the components of the GPU finished their wakeup?

Maybe drm_sched_pick_best() returns NULL since no scheduler is ready yet?

Apart from whether amdgpu_cs_ioctl() should run at this point, I still think the
suggested change improves the code. drm_sched_pick_best() can return NULL.
drm_sched_entity_select_rq() can handle the NULL (partially).

drm_sched_job_arm() crashes on an entity that has rq set to NULL.

The handling of NULL values is half-baked.

In my opinion, you should define if drm_sched_pick_best() may put a NULL into
rq. If your answer is yes, it might put a NULL there; then, there should be a
BUG_ON(!entity->rq) after the invocation of drm_sched_entity_select_rq().
If your answer is no, the BUG_ON() should be in drm_sched_pick_best().

That helps guys with zero domain knowledge, like me, to figure out how
this is all
supposed to work.

best regards,
 Philipp

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ