linux-kernel - Re: [PATCH] drm/sched: Document racy behavior of drm_sched_entity_push

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e9c02871-fa80-46c7-8b96-bad3a6a2c5b9@ursulin.net>
Date: Wed, 12 Nov 2025 09:42:13 +0000
From: Tvrtko Ursulin <tursulin@...ulin.net>
To: Philipp Stanner <phasta@...nel.org>,
 Matthew Brost <matthew.brost@...el.com>,
 Christian König <ckoenig.leichtzumerken@...il.com>,
 Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] drm/sched: Document racy behavior of
 drm_sched_entity_push_job()


On 12/11/2025 07:31, Philipp Stanner wrote:
> drm_sched_entity_push_job() uses the unlocked spsc_queue. It takes a
> reference to that queue's tip at the start, and some time later removes
> that entry from that list, without locking or protection against
> preemption.

I couldn't figure out what you refer to by tip reference at the start, 
and later removes it. Are you talking about the top level view from 
drm_sched_entity_push_job() or where exactly?
> This is by design, since the spsc_queue demands single producer and
> single consumer. It was, however, never documented.
> 
> Document that you must not call drm_sched_entity_push_job() in parallel
> for the same entity.
> 
> Signed-off-by: Philipp Stanner <phasta@...nel.org>
> ---
>   drivers/gpu/drm/scheduler/sched_entity.c | 3 +++
>   1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> index 5a4697f636f2..b31e8d14aa20 100644
> --- a/drivers/gpu/drm/scheduler/sched_entity.c
> +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> @@ -562,6 +562,9 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
>    * drm_sched_entity_push_job - Submit a job to the entity's job queue
>    * @sched_job: job to submit
>    *
> + * It is illegal to call this function in parallel, at least for jobs belonging
> + * to the same entity. Doing so leads to undefined behavior.

One thing that is documented in the very next paragraph is that the 
design implies a lock held between arm and push. At least to ensure 
seqno order matches the queue order.

I did not get what other breakage you found, but I also previously did 
find something other than that. Hm.. if I could only remember what it 
was. Probably mine was something involving drm_sched_entity_select_rq(), 
drm_sched_entity_modify_sched() and (theoretical) multi-threaded 
userspace submit on the same entity. Luckily it seems no one does that.

The issue you found is separate and not theoretically fixed by this 
hypothetical common lock held over arm and push?

Regards,

Tvrtko

> + *
>    * Note: To guarantee that the order of insertion to queue matches the job's
>    * fence sequence number this function should be called with drm_sched_job_arm()
>    * under common lock for the struct drm_sched_entity that was set up for