linux-kernel - Re: [PATCH v4] drm/panthor: Make the timeout per-queue instead of per-job

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPj87rOw2UrabPVHBw0ymJEV3LZ29vzL5KK9T2K0znoEyDYeaw@mail.gmail.com>
Date: Sat, 24 May 2025 16:03:37 +0100
From: Daniel Stone <daniel@...ishbar.org>
To: Ashley Smith <ashley.smith@...labora.com>
Cc: Boris Brezillon <boris.brezillon@...labora.com>, Steven Price <steven.price@....com>, 
	Liviu Dudau <liviu.dudau@....com>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>, 
	Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>, 
	David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>, Heiko Stuebner <heiko@...ech.de>, 
	kernel@...labora.com, Daniel Stone <daniels@...labora.com>, 
	dri-devel@...ts.freedesktop.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] drm/panthor: Make the timeout per-queue instead of per-job

Hi Ashley,

On Fri, 23 May 2025 at 16:10, Ashley Smith <ashley.smith@...labora.com> wrote:
> The timeout logic provided by drm_sched leads to races when we try
> to suspend it while the drm_sched workqueue queues more jobs. Let's
> overhaul the timeout handling in panthor to have our own delayed work
> that's resumed/suspended when a group is resumed/suspended. When an
> actual timeout occurs, we call drm_sched_fault() to report it
> through drm_sched, still. But otherwise, the drm_sched timeout is
> disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of
> how we protect modifications on the timer.
>
> One issue seems to be when we call drm_sched_suspend_timeout() from
> both queue_run_job() and tick_work() which could lead to races due to
> drm_sched_suspend_timeout() not having a lock. Another issue seems to
> be in queue_run_job() if the group is not scheduled, we suspend the
> timeout again which undoes what drm_sched_job_begin() did when calling
> drm_sched_start_timeout(). So the timeout does not reset when a job
> is finished.
>
> Co-developed-by: Boris Brezillon <boris.brezillon@...labora.com>
> Signed-off-by: Boris Brezillon <boris.brezillon@...labora.com>
> Tested-by: Daniel Stone <daniels@...labora.com>
> Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")

Unfortunately I have to revoke my T-b as we're seeing a pile of
failures in a CI stress test with this, e.g.
https://gitlab.freedesktop.org/daniels/mesa/-/jobs/77004047

Cheers,
Daniel