linux-kernel - Re: [PATCH] drm/sched: Document racy behavior of drm_sched_entity_push

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cd7f6684f1d8bfca606c4a6ba75c130d07e3a7fe.camel@mailbox.org>
Date: Wed, 12 Nov 2025 14:31:19 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Tvrtko Ursulin <tursulin@...ulin.net>, phasta@...nel.org, Matthew Brost
	 <matthew.brost@...el.com>, Christian König
	 <ckoenig.leichtzumerken@...il.com>, Maarten Lankhorst
	 <maarten.lankhorst@...ux.intel.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] drm/sched: Document racy behavior of
 drm_sched_entity_push_job()

On Wed, 2025-11-12 at 13:13 +0000, Tvrtko Ursulin wrote:
> 
> On 12/11/2025 12:15, Philipp Stanner wrote:
> > On Wed, 2025-11-12 at 09:42 +0000, Tvrtko Ursulin wrote:
> > > 
> > > On 12/11/2025 07:31, Philipp Stanner wrote:
> > > > drm_sched_entity_push_job() uses the unlocked spsc_queue. It takes a
> > > > reference to that queue's tip at the start, and some time later removes
> > > > that entry from that list, without locking or protection against
> > > > preemption.
> > > 
> > > I couldn't figure out what you refer to by tip reference at the start,
> > > and later removes it. Are you talking about the top level view from
> > > drm_sched_entity_push_job() or where exactly?
> > > > This is by design, since the spsc_queue demands single producer and
> > > > single consumer. It was, however, never documented.
> > > > 
> > > > Document that you must not call drm_sched_entity_push_job() in parallel
> > > > for the same entity.
> > > > 
> > > > Signed-off-by: Philipp Stanner <phasta@...nel.org>
> > > > ---
> > > >    drivers/gpu/drm/scheduler/sched_entity.c | 3 +++
> > > >    1 file changed, 3 insertions(+)
> > > > 
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > index 5a4697f636f2..b31e8d14aa20 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > @@ -562,6 +562,9 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> > > >     * drm_sched_entity_push_job - Submit a job to the entity's job queue
> > > >     * @sched_job: job to submit
> > > >     *
> > > > + * It is illegal to call this function in parallel, at least for jobs belonging
> > > > + * to the same entity. Doing so leads to undefined behavior.
> > > 
> > > One thing that is documented in the very next paragraph is that the
> > > design implies a lock held between arm and push. At least to ensure
> > > seqno order matches the queue order.
> > > 
> > > I did not get what other breakage you found, but I also previously did
> > > find something other than that. Hm.. if I could only remember what it
> > > was. Probably mine was something involving drm_sched_entity_select_rq(),
> > > drm_sched_entity_modify_sched() and (theoretical) multi-threaded
> > > userspace submit on the same entity. Luckily it seems no one does that.
> > > 
> > > The issue you found is separate and not theoretically fixed by this
> > > hypothetical common lock held over arm and push?
> > 
> > Well, if 2 CPUs should ever run in parallel in
> > drm_sched_entity_push_job() the spsc_queue will just explode. Most
> > notably, one CPU could get the job at the tip (the oldest job), then be
> > preempted, and then another CPU takes the same job and pops it.
> 
> Ah, you are talking about the dequeue/pop side. First paragraph of the 
> commit message can be clarified in that case.
> 
> Pop is serialised by the worker so I don't think two simultaneous 
> dequeues on the same scheduler are possible. How did you trigger it?
> > The API contract should be that the user doesn't have to know whether
> > there's a linked list or the magic spsc_queue.Luckily we moved the peek/pop helpers to sched_internal.h.
> 
> Btw I thought you gave up on the scheduler and are working on the simple 
> rust queue for firmware schedulers so how come you are finding subtle
> bugs in this code?

I'm a maintainer still, for a variety of reasons. That we work on
something for FW with a clean locking design doesn't mean we don't care
about existing infrastructure anymore. And I firmly believe in
documentation. People should know the rules from the API documentation,
not from looking into the implementation.

It's kind of a crucial design info that you must only push into
entities sequentially, no?

This doesn't fix a bug, obviously, since it's just a line of docu.
Regardless, pushing into the spsc queue in parallel would explode.
Emphasizing that costs as nothing.



P.