linux-kernel - Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9566a923b606e16f7725c79bd8e0a2c7c80daf9e.camel@mailbox.org>
Date: Tue, 22 Apr 2025 14:21:07 +0200
From: Philipp Stanner <phasta@...lbox.org>
To: Tvrtko Ursulin <tvrtko.ursulin@...lia.com>, Danilo Krummrich
	 <dakr@...nel.org>
Cc: phasta@...nel.org, Lyude Paul <lyude@...hat.com>, David Airlie
 <airlied@...il.com>, Simona Vetter <simona@...ll.ch>, Matthew Brost
 <matthew.brost@...el.com>, Christian König
 <ckoenig.leichtzumerken@...il.com>, Maarten Lankhorst
 <maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>, 
 Thomas Zimmermann <tzimmermann@...e.de>, dri-devel@...ts.freedesktop.org,
 nouveau@...ts.freedesktop.org,  linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/5] drm/sched: Warn if pending list is not empty

On Tue, 2025-04-22 at 13:07 +0100, Tvrtko Ursulin wrote:
> 
> On 22/04/2025 12:13, Danilo Krummrich wrote:
> > On Tue, Apr 22, 2025 at 11:39:11AM +0100, Tvrtko Ursulin wrote:
> > > Question I raised is if there are other drivers which manage to
> > > clean up
> > > everything correctly (like the mock scheduler does), but trigger
> > > that
> > > warning. Maybe there are not and maybe mock scheduler is the only
> > > false
> > > positive.
> > 
> > So far the scheduler simply does not give any guideline on how to
> > address the
> > problem, hence every driver simply does something (or nothing,
> > effectively
> > ignoring the problem). This is what we want to fix.
> > 
> > The mock scheduler keeps it's own list of pending jobs and on tear
> > down stops
> > the scheduler's workqueues, traverses it's own list and eventually
> > frees the
> > pending jobs without updating the scheduler's internal pending
> > list.
> > 
> > So yes, it does avoid memory leaks, but it also leaves the
> > schedulers internal
> > structures with an invalid state, i.e. the pending list of the
> > scheduler has
> > pointers to already freed memory.
> > 
> > What if the drm_sched_fini() starts touching the pending list? Then
> > you'd end up
> > with UAF bugs with this implementation. We cannot invalidate the
> > schedulers
> > internal structures and yet call scheduler functions - e.g.
> > drm_sched_fini() -
> > subsequently.
> > 
> > Hence, the current implementation of the mock scheduler is
> > fundamentally flawed.
> > And so would be *every* driver that still has entries within the
> > scheduler's
> > pending list.
> > 
> > This is not a false positive, it already caught a real bug -- in
> > the mock
> > scheduler.
> 
> To avoid furher splitting hairs on whether real bugs need to be able
> to 
> manifest or not, lets move past this with a conclusion that there are
> two potential things to do here:
> 
> First one is to either send separately or include in this series 
> something like:
> 
> diff --git a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c 
> b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> index f999c8859cf7..7c4df0e890ac 100644
> --- a/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> +++ b/drivers/gpu/drm/scheduler/tests/mock_scheduler.c
> @@ -300,6 +300,8 @@ void drm_mock_sched_fini(struct
> drm_mock_scheduler 
> *sched)
>                  drm_mock_sched_job_complete(job);
>          spin_unlock_irqrestore(&sched->lock, flags);
> 
> +       drm_sched_fini(&sched->base);
> +
>          /*
>           * Free completed jobs and jobs not yet processed by the DRM
> scheduler
>           * free worker.
> @@ -311,8 +313,6 @@ void drm_mock_sched_fini(struct
> drm_mock_scheduler 
> *sched)
> 
>          list_for_each_entry_safe(job, next, &list, link)
>                  mock_sched_free_job(&job->base);
> -
> -       drm_sched_fini(&sched->base);
>   }
> 
>   /**
> 
> That should satisfy the requirement to "clear" memory about to be
> freed 
> and be 100% compliant with drm_sched_fini() kerneldoc (guideline b).
> 
> But the new warning from 3/5 here will still be there AFAICT and
> would 
> you then agree it is a false positive?
> 
> Secondly, the series should modify all drivers (including the unit 
> tests) which are known to trigger this false positive.

There is no false positive. I just stated that in my other answer.

That said, I agree that the unit tests should be a second user of this
code in addition to Nouveau. They then should also provide the
fence_context_kill() callback, though.

And it seems we also agree that the memory leaks must be solved
centrally in drm_sched_fini().


P.

> 
> Regards,
> 
> Tvrtko
>