[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9566a923b606e16f7725c79bd8e0a2c7c80daf9e.camel@mailbox.org>
Date: Tue, 22 Apr 2025 14:21:07 +0200
From: Philipp Stanner <phasta@...lbox.org>
To: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>, Danilo Krummrich
<dakr@...nel.org>
Cc: phasta@...nel.org, Lyude Paul <lyude@...hat.com>, David Airlie
<airlied@...il.com>, Simona Vetter <simona@...ll.ch>, Matthew Brost
<matthew.brost@...el.com>, Christian König
<ckoenig.leichtzumerken@...il.com>, Maarten Lankhorst
<maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
Thomas Zimmermann <tzimmermann@...e.de>, dri-devel@...ts.freedesktop.org,
nouveau@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty
On Tue, 2025-04-22 at 13:07 +0100, Tvrtko Ursulin wrote:
>
> On 22/04/2025 12:13, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > Question I raised is if there are other drivers which manage to
> > > clean up
> > > everything correctly (like the mock scheduler does), but trigger
> > > that
> > > warning. Maybe there are not and maybe mock scheduler is the only
> > > false
> > > positive.
> >
> > So far the scheduler simply does not give any guideline on how to
> > address the
> > problem, hence every driver simply does something (or nothing,
> > effectively
> > ignoring the problem). This is what we want to fix.
> >
> > The mock scheduler keeps it's own list of pending jobs and on tear
> > down stops
> > the scheduler's workqueues, traverses it's own list and eventually
> > frees the
> > pending jobs without updating the scheduler's internal pending
> > list.
> >
> > So yes, it does avoid memory leaks, but it also leaves the
> > schedulers internal
> > structures with an invalid state, i.e. the pending list of the
> > scheduler has
> > pointers to already freed memory.
> >
> > What if the drm_sched_fini() starts touching the pending list? Then
> > you'd end up
> > with UAF bugs with this implementation. We cannot invalidate the
> > schedulers
> > internal structures and yet call scheduler functions - e.g.
> > drm_sched_fini() -
> > subsequently.
> >
> > Hence, the current implementation of the mock scheduler is
> > fundamentally flawed.
> > And so would be *every* driver that still has entries within the
> > scheduler's
> > pending list.
> >
> > This is not a false positive, it already caught a real bug -- in
> > the mock
> > scheduler.
>
> To avoid furher splitting hairs on whether real bugs need to be able
> to
> manifest or not, lets move past this with a conclusion that there are
> two potential things to do here:
>
> First one is to either send separately or include in this series
> something like:
>
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index f999c8859cf7..7c4df0e890ac 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct
> drm_mock_scheduler
> *sched)
> drm_mock_sched_job_complete(job);
> spin_unlock_irqrestore(&sched->lock, flags);
>
> + drm_sched_fini(&sched->base);
> +
> /*
> * Free completed jobs and jobs not yet processed by the DRM
> scheduler
> * free worker.
> @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct
> drm_mock_scheduler
> *sched)
>
> list_for_each_entry_safe(job, next, &list, link)
> mock_sched_free_job(&job->base);
> -
> - drm_sched_fini(&sched->base);
> }
>
> /**
>
> That should satisfy the requirement to "clear" memory about to be
> freed
> and be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
>
> But the new warning from 3/5 here will still be there AFAICT and
> would
> you then agree it is a false positive?
>
> Secondly, the series should modify all drivers (including the unit
> tests) which are known to trigger this false positive.
There is no false positive. I just stated that in my other answer.
That said, I agree that the unit tests should be a second user of this
code in addition to Nouveau. They then should also provide the
fence_context_kill() callback, though.
And it seems we also agree that the memory leaks must be solved
centrally in drm_sched_fini().
P.
>
> Regards,
>
> Tvrtko
>
Powered by blists - more mailing lists