[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7404e73561283f806456d1ecab92fbc17dfdd57a.camel@mailbox.org>
Date: Thu, 14 Aug 2025 13:49:50 +0200
From: Philipp Stanner <phasta@...lbox.org>
To: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>, phasta@...nel.org, James
Flowers <bold.zone2373@...tmail.com>, matthew.brost@...el.com,
dakr@...nel.org, ckoenig.leichtzumerken@...il.com,
maarten.lankhorst@...ux.intel.com, mripard@...nel.org,
tzimmermann@...e.de, airlied@...il.com, simona@...ll.ch,
skhan@...uxfoundation.org
Cc: dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org,
linux-kernel-mentees@...ts.linux.dev
Subject: Re: [PATCH] drm/sched: Prevent stopped entities from being added to
the run queue.
On Thu, 2025-08-14 at 12:45 +0100, Tvrtko Ursulin wrote:
>
> On 14/08/2025 11:42, Tvrtko Ursulin wrote:
> >
> > On 21/07/2025 08:52, Philipp Stanner wrote:
> > > +Cc Tvrtko, who's currently reworking FIFO and RR.
> > >
> > > On Sun, 2025-07-20 at 16:56 -0700, James Flowers wrote:
> > > > Fixes an issue where entities are added to the run queue in
> > > > drm_sched_rq_update_fifo_locked after being killed, causing a
> > > > slab-use-after-free error.
> > > >
> > > > Signed-off-by: James Flowers <bold.zone2373@...tmail.com>
> > > > ---
> > > > This issue was detected by syzkaller running on a Steam Deck OLED.
> > > > Unfortunately I don't have a reproducer for it. I've
> > >
> > > Well, now that's kind of an issue – if you don't have a reproducer, how
> > > can you know that your patch is correct? How can we?
> > >
> > > It would certainly be good to know what the fuzz testing framework
> > > does.
> > >
> > > > included the KASAN reports below:
> > >
> > >
> > > Anyways, KASAN reports look interesting. But those might be many
> > > different issues. Again, would be good to know what the fuzzer has been
> > > testing. Can you maybe split this fuzz test into sub-tests? I suspsect
> > > those might be different faults.
> > >
> > >
> > > Anyways, taking a first look…
> > >
> > >
[SNIP]
> > > >
> > > > ==================================================================
> > >
> > > If this is a race, then this patch here is broken, too, because you're
> > > checking the 'stopped' boolean as the callers of that function do, too
> > > – just later. :O
> > >
> > > Could still race, just less likely.
> > >
> > > The proper way to fix it would then be to address the issue where the
> > > locking is supposed to happen. Let's look at, for example,
> > > drm_sched_entity_push_job():
> > >
> > >
> > > void drm_sched_entity_push_job(struct drm_sched_job *sched_job)
> > > {
> > > (Bla bla bla)
> > >
> > > …………
> > >
> > > /* first job wakes up scheduler */
> > > if (first) {
> > > struct drm_gpu_scheduler *sched;
> > > struct drm_sched_rq *rq;
> > >
> > > /* Add the entity to the run queue */
> > > spin_lock(&entity->lock);
> > > if (entity->stopped) { <---- Aha!
> > > spin_unlock(&entity->lock);
> > >
> > > DRM_ERROR("Trying to push to a killed entity\n");
> > > return;
> > > }
> > >
> > > rq = entity->rq;
> > > sched = rq->sched;
> > >
> > > spin_lock(&rq->lock);
> > > drm_sched_rq_add_entity(rq, entity);
> > >
> > > if (drm_sched_policy == DRM_SCHED_POLICY_FIFO)
> > > drm_sched_rq_update_fifo_locked(entity, rq, submit_ts);
> > > <---- bumm!
> > >
> > > spin_unlock(&rq->lock);
> > > spin_unlock(&entity->lock);
> > >
> > > But the locks are still being hold. So that "shouldn't be happening"(tm).
> > >
> > > Interesting. AFAICS only drm_sched_entity_kill() and drm_sched_fini()
> > > stop entities. The former holds appropriate locks, but drm_sched_fini()
> > > doesn't. So that looks like a hot candidate to me. Opinions?
> > >
> > > On the other hand, aren't drivers prohibited from calling
> > > drm_sched_entity_push_job() after calling drm_sched_fini()? If the
> > > fuzzer does that, then it's not the scheduler's fault.
> > >
> > > Could you test adding spin_lock(&entity->lock) to drm_sched_fini()?
> > >
> > > Would be cool if Tvrtko and Christian take a look. Maybe we even have a
> > > fundamental design issue.
> >
> > It would be nice to have a reproducer and from this thread I did not
> > manage to figure out if the syzkaller snipper James posted was it, or
> > not quite it.
> >
> > In either case, I think one race I see relates to the early exit !
> > entity->rq check before setting entity->stopped in drm_sched_entity_kill().
> >
> > If the entity was not submitted at all yet (at the time of process
> > exit / entity kill), entity->stopped will therefore not be set. A
> > parallel job submit can then re-add the entity to the tree, as process
> > exit / file close / entity kill is continuing and is about to kfree the
> > entity (in the case of amdgpu report there are two entities embedded in
> > file_priv).
> >
> > One way to make this more robust is to make the entity->rq check in
> > drm_sched_entity_kill() stronger. Or actually to remove it altogether.
> > But I think it also requires checking for entity->stopped in
> > drm_sched_entity_select_rq() and propagating the error code all the way
> > out from drm_sched_job_arm().
> >
> > That was entity->stopped is properly serialized and acted upon early
> > enough to avoid dereferencing a freed entity and avoid creating jobs not
> > attached to anything (but only have a warning from push job).
> >
> > Disclaimer I haven't tried to experiment with this yet, so I may be
> > missing something. At least writing a reproducer for the race I
> > described sounds easy so unless someone shouts I am talking nonsense I
> > can do that and also sketch out a fix. *If* the theory will hold water
> > after I write the test case.
>
> Nah I was talking nonsense. Forgot entity->rq is assigned on entity init
> and jobs cannot be created unless it is set.
>
> Okay, I have no theories as to what bug syzkaller found.
I just was about to answer.
I agree that the rq check should be fine.
As you can see in the thread, I suspect that this is a race between
drm_sched_entity_push_job() and drm_sched_fini().
See here:
https://lore.kernel.org/dri-devel/20250813085654.102504-2-phasta@kernel.org/
I think as long as there's no reproducer there is not much to do for us
here. A long term goal, though, is to enforce the life time rules.
Entities must be torn down before their scheduler. Checking this for
all drivers will be quite some work, though..
P.
>
> Regards,
>
> Tvrtko
>
> >
> > >
> > > > }
> > > > /**
> > >
> >
>
Powered by blists - more mailing lists