linux-kernel - Re: [RFC PATCH] drm/sched: Fix teardown leaks with waitqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ccbf75a939b90fe9a41734fab038bc35f0963878.camel@redhat.com>
Date: Thu, 05 Sep 2024 09:38:01 +0200
From: Philipp Stanner <pstanner@...hat.com>
To: Luben Tuikov <ltuikov89@...il.com>, Matthew Brost
 <matthew.brost@...el.com>,  Maarten Lankhorst
 <maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
 Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>, 
 dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org, Danilo
 Krummrich <dakr@...hat.com>
Subject: Re: [RFC PATCH] drm/sched: Fix teardown leaks with waitqueue

On Wed, 2024-09-04 at 19:47 +0200, Simona Vetter wrote:
> On Tue, Sep 03, 2024 at 11:44:47AM +0200, Philipp Stanner wrote:
> > The GPU scheduler currently does not ensure that its pending_list
> > is
> > empty before performing various other teardown tasks in
> > drm_sched_fini().
> > 
> > If there are still jobs in the pending_list, this is problematic
> > because
> > after scheduler teardown, no one will call backend_ops.free_job()
> > anymore. This would, consequently, result in memory leaks.
> > 
> > One way to solves this is to implement a waitqueue that
> > drm_sched_fini()
> > blocks on until the pending_list has become empty.
> > 
> > Add a waitqueue to struct drm_gpu_scheduler. Wake up waiters once
> > the
> > pending_list becomes empty. Wait in drm_sched_fini() for that to
> > happen.
> > 
> > Suggested-by: Danilo Krummrich <dakr@...hat.com>
> > Signed-off-by: Philipp Stanner <pstanner@...hat.com>
> > ---
> > Hi all,
> > 
> > since the scheduler has many stake holders, I want this solution
> > discussed as an RFC first.
> > 
> > This version here has IMO the advantage (and disadvantage...) that
> > it
> > blocks infinitly if the driver messed up the clean up, so problems
> > might become more visible than the refcount solution I proposed in
> > parallel.
> 
> Very quick comment because I'm heading out for the r4l conference,
> but
> maybe I can discuss this there with Danilo a bit.
> 
> Maybe we should do step 0 first, and document the current rules? The
> kerneldoc isn't absolute zero anymore, but it's very, very bare-
> bones.
> Then get that acked and merged, which is a very good way to make sure
> we're actually standing on common ground.

Yes, documentation is definitely also on my TODO list. I wanted to send
out something clarifying the objects' lifetimes (based on Christian's
previous series [1]) quite soon.

> 
> Then maybe step 0.5 would be to add runtime asserts to enforce the
> rules,
> like if you tear down the scheduler and there's stuff in flight, you
> splat
> on a WARN_ON.
> 
> With that foundation there should be a lot clearer basis to discuss
> whether there is an issue, and what a better design could look like.

I mean, I'm quite sure that there are teardown issues. But we could
indeed make them visible first through documentation (and a FIXME tag)
and after establishing consensus through merging that go on as you
suggested.

> The
> little pondering I've done I've come up with a few more ideas along
> similar lines.
> 
> One comment below, kinda unrelated.
> 
> > 
> > Cheers,
> > P.
> > ---
> >  drivers/gpu/drm/scheduler/sched_main.c | 40
> > ++++++++++++++++++++++++++
> >  include/drm/gpu_scheduler.h            |  4 +++
> >  2 files changed, 44 insertions(+)
> > 
> > diff --git a/drivers/gpu/drm/scheduler/sched_main.c
> > b/drivers/gpu/drm/scheduler/sched_main.c
> > index 7e90c9f95611..200fa932f289 100644
> > --- a/drivers/gpu/drm/scheduler/sched_main.c
> > +++ b/drivers/gpu/drm/scheduler/sched_main.c
> > @@ -564,6 +564,13 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> >  		 * is parked at which point it's safe.
> >  		 */
> >  		list_del_init(&job->list);
> > +
> > +		/*
> > +		 * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > +		 */
> > +		if (list_empty(&sched->pending_list))
> > +			wake_up(&sched->job_list_waitque);
> > +
> >  		spin_unlock(&sched->job_list_lock);
> >  
> >  		status = job->sched->ops->timedout_job(job);
> > @@ -584,6 +591,15 @@ static void drm_sched_job_timedout(struct
> > work_struct *work)
> >  		drm_sched_start_timeout_unlocked(sched);
> >  }
> >  
> > +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler
> > *sched)
> > +{
> > +	/*
> > +	 * For list_empty() to work without a lock.
> > +	 */
> 
> So this is pretty far from the gold standard for documenting memory
> barrier semantics in lockless code. Ideally we have a comment for
> both
> sides of the barrier (you always need two, or there's no function
> barrier), pointing at each another, and explaining exactly what's
> being
> synchronized against what and how.
> 
> I did years ago add a few missing barriers with that approach, see
> b0a5303d4e14 ("drm/sched: Barriers are needed for
> entity->last_scheduled"). Reading that patch again it feels a bit on
> the
> terse side of things (plus I noticed spelling issues now too, oops).

ACK, will do that properly once we role out the actual patch.


P.

> 
> Cheers, Sima
> 
> > +	rmb();
> > +	return list_empty(&sched->pending_list);
> > +}
> > +
> >  /**
> >   * drm_sched_stop - stop the scheduler
> >   *
> > @@ -659,6 +675,12 @@ void drm_sched_stop(struct drm_gpu_scheduler
> > *sched, struct drm_sched_job *bad)
> >  		}
> >  	}
> >  
> > +	/*
> > +	 * Inform tasks blocking in drm_sched_fini() that it's now
> > safe to proceed.
> > +	 */
> > +	if (drm_sched_no_jobs_pending(sched))
> > +		wake_up(&sched->job_list_waitque);
> > +
> >  	/*
> >  	 * Stop pending timer in flight as we rearm it in 
> > drm_sched_start. This
> >  	 * avoids the pending timeout work in progress to fire
> > right away after
> > @@ -1085,6 +1107,12 @@ drm_sched_get_finished_job(struct
> > drm_gpu_scheduler *sched)
> >  		/* remove job from pending_list */
> >  		list_del_init(&job->list);
> >  
> > +		/*
> > +		 * Inform tasks blocking in drm_sched_fini() that
> > it's now safe to proceed.
> > +		 */
> > +		if (list_empty(&sched->pending_list))
> > +			wake_up(&sched->job_list_waitque);
> > +
> >  		/* cancel this job's TO timer */
> >  		cancel_delayed_work(&sched->work_tdr);
> >  		/* make the scheduled timestamp more accurate */
> > @@ -1303,6 +1331,7 @@ int drm_sched_init(struct drm_gpu_scheduler
> > *sched,
> >  	init_waitqueue_head(&sched->job_scheduled);
> >  	INIT_LIST_HEAD(&sched->pending_list);
> >  	spin_lock_init(&sched->job_list_lock);
> > +	init_waitqueue_head(&sched->job_list_waitque);
> >  	atomic_set(&sched->credit_count, 0);
> >  	INIT_DELAYED_WORK(&sched->work_tdr,
> > drm_sched_job_timedout);
> >  	INIT_WORK(&sched->work_run_job, drm_sched_run_job_work);
> > @@ -1333,12 +1362,23 @@ EXPORT_SYMBOL(drm_sched_init);
> >   * @sched: scheduler instance
> >   *
> >   * Tears down and cleans up the scheduler.
> > + *
> > + * Note that this function blocks until the fences returned by
> > + * backend_ops.run_job() have been signalled.
> >   */
> >  void drm_sched_fini(struct drm_gpu_scheduler *sched)
> >  {
> >  	struct drm_sched_entity *s_entity;
> >  	int i;
> >  
> > +	/*
> > +	 * Jobs that have neither been scheduled or which have
> > timed out are
> > +	 * gone by now, but jobs that have been submitted through
> > +	 * backend_ops.run_job() and have not yet terminated are
> > still pending.
> > +	 *
> > +	 * Wait for the pending_list to become empty to avoid
> > leaking those jobs.
> > +	 */
> > +	wait_event(sched->job_list_waitque,
> > drm_sched_no_jobs_pending(sched));
> >  	drm_sched_wqueue_stop(sched);
> >  
> >  	for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs;
> > i++) {
> > diff --git a/include/drm/gpu_scheduler.h
> > b/include/drm/gpu_scheduler.h
> > index 5acc64954a88..bff092784405 100644
> > --- a/include/drm/gpu_scheduler.h
> > +++ b/include/drm/gpu_scheduler.h
> > @@ -29,6 +29,7 @@
> >  #include <linux/completion.h>
> >  #include <linux/xarray.h>
> >  #include <linux/workqueue.h>
> > +#include <linux/wait.h>
> >  
> >  #define MAX_WAIT_SCHED_ENTITY_Q_EMPTY msecs_to_jiffies(1000)
> >  
> > @@ -503,6 +504,8 @@ struct drm_sched_backend_ops {
> >   *            timeout interval is over.
> >   * @pending_list: the list of jobs which are currently in the job
> > queue.
> >   * @job_list_lock: lock to protect the pending_list.
> > + * @job_list_waitque: a waitqueue for drm_sched_fini() to block on
> > until all
> > + *		      pending jobs have been finished.
> >   * @hang_limit: once the hangs by a job crosses this limit then it
> > is marked
> >   *              guilty and it will no longer be considered for
> > scheduling.
> >   * @score: score to help loadbalancer pick a idle sched
> > @@ -532,6 +535,7 @@ struct drm_gpu_scheduler {
> >  	struct delayed_work		work_tdr;
> >  	struct list_head		pending_list;
> >  	spinlock_t			job_list_lock;
> > +	wait_queue_head_t		job_list_waitque;
> >  	int				hang_limit;
> >  	atomic_t                        *score;
> >  	atomic_t                        _score;
> > -- 
> > 2.46.0
> > 
>