linux-kernel - Re: [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aC8cGPx_m8g2ApcV@pollux>
Date: Thu, 22 May 2025 14:44:08 +0200
From: Danilo Krummrich <dakr@...nel.org>
To: Philipp Stanner <phasta@...nel.org>
Cc: Lyude Paul <lyude@...hat.com>, David Airlie <airlied@...il.com>,
	Simona Vetter <simona@...ll.ch>,
	Matthew Brost <matthew.brost@...el.com>,
	Christian König <ckoenig.leichtzumerken@...il.com>,
	Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
	Maxime Ripard <mripard@...nel.org>,
	Thomas Zimmermann <tzimmermann@...e.de>,
	Tvrtko Ursulin <tvrtko.ursulin@...lia.com>,
	dri-devel@...ts.freedesktop.org, nouveau@...ts.freedesktop.org,
	linux-kernel@...r.kernel.org, Philipp Stanner <pstanner@...hat.com>
Subject: Re: [PATCH v3 1/5] drm/sched: Fix teardown leaks with waitqueue

On Thu, May 22, 2025 at 10:27:39AM +0200, Philipp Stanner wrote:
> +/**
> + * drm_sched_submission_and_timeout_stop - stop everything except for free_job
> + * @sched: scheduler instance
> + *
> + * Helper for tearing down the scheduler in drm_sched_fini().
> + */
> +static void
> +drm_sched_submission_and_timeout_stop(struct drm_gpu_scheduler *sched)
> +{
> +	WRITE_ONCE(sched->pause_submit, true);
> +	cancel_work_sync(&sched->work_run_job);
> +	cancel_delayed_work_sync(&sched->work_tdr);
> +}
> +
> +/**
> + * drm_sched_free_stop - stop free_job
> + * @sched: scheduler instance
> + *
> + * Helper for tearing down the scheduler in drm_sched_fini().
> + */
> +static void drm_sched_free_stop(struct drm_gpu_scheduler *sched)
> +{
> +	WRITE_ONCE(sched->pause_free, true);
> +	cancel_work_sync(&sched->work_free_job);
> +}
> +
> +/**
> + * drm_sched_no_jobs_pending - check whether jobs are pending
> + * @sched: scheduler instance
> + *
> + * Checks if jobs are pending for @sched.
> + *
> + * Return: true if jobs are pending, false otherwise.
> + */
> +static bool drm_sched_no_jobs_pending(struct drm_gpu_scheduler *sched)
> +{
> +	bool empty;
> +
> +	spin_lock(&sched->job_list_lock);
> +	empty = list_empty(&sched->pending_list);
> +	spin_unlock(&sched->job_list_lock);
> +
> +	return empty;
> +}

I understand that the way you use this function is correct, since you only call
it *after* drm_sched_submission_and_timeout_stop(), which means that no new
items can end up on the pending_list.

But if we look at this function without context, it's broken:

The documentation says "Return: true if jobs are pending, false otherwise.", but
you can't guarantee that, since a new job could be added to the pending_list
after spin_unlock().

Hence, providing this function is a footgun.

Instead, you should put this teardown sequence in a single function, where you
can control the external conditions, i.e. that
drm_sched_submission_and_timeout_stop() has been called.

Please also add a comment explaining why we can release the lock and still work
with the value returned by list_empty() in this case, i.e. because we guarantee
that the list item count converges against zero.

The other two helpers above, drm_sched_submission_and_timeout_stop() and
drm_sched_free_stop() should be fine to have.

> +/**
> + * drm_sched_cancel_jobs_and_wait - trigger freeing of all pending jobs
> + * @sched: scheduler instance
> + *
> + * Must only be called if &struct drm_sched_backend_ops.cancel_pending_fences is
> + * implemented.
> + *
> + * Instructs the driver to kill the fence context associated with this scheduler,
> + * thereby signaling all pending fences. This, in turn, will trigger
> + * &struct drm_sched_backend_ops.free_job to be called for all pending jobs.
> + * The function then blocks until all pending jobs have been freed.
> + */
> +static void drm_sched_cancel_jobs_and_wait(struct drm_gpu_scheduler *sched)
> +{
> +	sched->ops->cancel_pending_fences(sched);
> +	wait_event(sched->pending_list_waitque, drm_sched_no_jobs_pending(sched));
> +}

Same here, you can't have this as an isolated helper.