[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <cd7f6684f1d8bfca606c4a6ba75c130d07e3a7fe.camel@mailbox.org>
Date: Wed, 12 Nov 2025 14:31:19 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Tvrtko Ursulin <tursulin@...ulin.net>, phasta@...nel.org, Matthew Brost
<matthew.brost@...el.com>, Christian König
<ckoenig.leichtzumerken@...il.com>, Maarten Lankhorst
<maarten.lankhorst@...ux.intel.com>
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] drm/sched: Document racy behavior of
drm_sched_entity_push_job()
On Wed, 2025-11-12 at 13:13 +0000, Tvrtko Ursulin wrote:
>
> On 12/11/2025 12:15, Philipp Stanner wrote:
> > On Wed, 2025-11-12 at 09:42 +0000, Tvrtko Ursulin wrote:
> > >
> > > On 12/11/2025 07:31, Philipp Stanner wrote:
> > > > drm_sched_entity_push_job() uses the unlocked spsc_queue. It takes a
> > > > reference to that queue's tip at the start, and some time later removes
> > > > that entry from that list, without locking or protection against
> > > > preemption.
> > >
> > > I couldn't figure out what you refer to by tip reference at the start,
> > > and later removes it. Are you talking about the top level view from
> > > drm_sched_entity_push_job() or where exactly?
> > > > This is by design, since the spsc_queue demands single producer and
> > > > single consumer. It was, however, never documented.
> > > >
> > > > Document that you must not call drm_sched_entity_push_job() in parallel
> > > > for the same entity.
> > > >
> > > > Signed-off-by: Philipp Stanner <phasta@...nel.org>
> > > > ---
> > > > drivers/gpu/drm/scheduler/sched_entity.c | 3 +++
> > > > 1 file changed, 3 insertions(+)
> > > >
> > > > diff --git a/drivers/gpu/drm/scheduler/sched_entity.c b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > index 5a4697f636f2..b31e8d14aa20 100644
> > > > --- a/drivers/gpu/drm/scheduler/sched_entity.c
> > > > +++ b/drivers/gpu/drm/scheduler/sched_entity.c
> > > > @@ -562,6 +562,9 @@ void drm_sched_entity_select_rq(struct drm_sched_entity *entity)
> > > > * drm_sched_entity_push_job - Submit a job to the entity's job queue
> > > > * @sched_job: job to submit
> > > > *
> > > > + * It is illegal to call this function in parallel, at least for jobs belonging
> > > > + * to the same entity. Doing so leads to undefined behavior.
> > >
> > > One thing that is documented in the very next paragraph is that the
> > > design implies a lock held between arm and push. At least to ensure
> > > seqno order matches the queue order.
> > >
> > > I did not get what other breakage you found, but I also previously did
> > > find something other than that. Hm.. if I could only remember what it
> > > was. Probably mine was something involving drm_sched_entity_select_rq(),
> > > drm_sched_entity_modify_sched() and (theoretical) multi-threaded
> > > userspace submit on the same entity. Luckily it seems no one does that.
> > >
> > > The issue you found is separate and not theoretically fixed by this
> > > hypothetical common lock held over arm and push?
> >
> > Well, if 2 CPUs should ever run in parallel in
> > drm_sched_entity_push_job() the spsc_queue will just explode. Most
> > notably, one CPU could get the job at the tip (the oldest job), then be
> > preempted, and then another CPU takes the same job and pops it.
>
> Ah, you are talking about the dequeue/pop side. First paragraph of the
> commit message can be clarified in that case.
>
> Pop is serialised by the worker so I don't think two simultaneous
> dequeues on the same scheduler are possible. How did you trigger it?
> > The API contract should be that the user doesn't have to know whether
> > there's a linked list or the magic spsc_queue.Luckily we moved the peek/pop helpers to sched_internal.h.
>
> Btw I thought you gave up on the scheduler and are working on the simple
> rust queue for firmware schedulers so how come you are finding subtle
> bugs in this code?
I'm a maintainer still, for a variety of reasons. That we work on
something for FW with a clean locking design doesn't mean we don't care
about existing infrastructure anymore. And I firmly believe in
documentation. People should know the rules from the API documentation,
not from looking into the implementation.
It's kind of a crucial design info that you must only push into
entities sequentially, no?
This doesn't fix a bug, obviously, since it's just a line of docu.
Regardless, pushing into the spsc queue in parallel would explode.
Emphasizing that costs as nothing.
P.
Powered by blists - more mailing lists