linux-kernel - Re: [PATCH v4] drm/panthor: Make the timeout per-queue instead of per-job

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250526091646.7020bcff@collabora.com>
Date: Mon, 26 May 2025 09:16:46 +0200
From: Boris Brezillon <boris.brezillon@...labora.com>
To: Daniel Stone <daniel@...ishbar.org>
Cc: Ashley Smith <ashley.smith@...labora.com>, Steven Price
 <steven.price@....com>, Liviu Dudau <liviu.dudau@....com>, Maarten
 Lankhorst <maarten.lankhorst@...ux.intel.com>, Maxime Ripard
 <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>, David Airlie
 <airlied@...il.com>, Simona Vetter <simona@...ll.ch>, Heiko Stuebner
 <heiko@...ech.de>, kernel@...labora.com, Daniel Stone
 <daniels@...labora.com>, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v4] drm/panthor: Make the timeout per-queue instead of
 per-job

On Sat, 24 May 2025 16:03:37 +0100
Daniel Stone <daniel@...ishbar.org> wrote:

> Hi Ashley,
> 
> On Fri, 23 May 2025 at 16:10, Ashley Smith <ashley.smith@...labora.com> wrote:
> > The timeout logic provided by drm_sched leads to races when we try
> > to suspend it while the drm_sched workqueue queues more jobs. Let's
> > overhaul the timeout handling in panthor to have our own delayed work
> > that's resumed/suspended when a group is resumed/suspended. When an
> > actual timeout occurs, we call drm_sched_fault() to report it
> > through drm_sched, still. But otherwise, the drm_sched timeout is
> > disabled (set to MAX_SCHEDULE_TIMEOUT), which leaves us in control of
> > how we protect modifications on the timer.
> >
> > One issue seems to be when we call drm_sched_suspend_timeout() from
> > both queue_run_job() and tick_work() which could lead to races due to
> > drm_sched_suspend_timeout() not having a lock. Another issue seems to
> > be in queue_run_job() if the group is not scheduled, we suspend the
> > timeout again which undoes what drm_sched_job_begin() did when calling
> > drm_sched_start_timeout(). So the timeout does not reset when a job
> > is finished.
> >
> > Co-developed-by: Boris Brezillon <boris.brezillon@...labora.com>
> > Signed-off-by: Boris Brezillon <boris.brezillon@...labora.com>
> > Tested-by: Daniel Stone <daniels@...labora.com>
> > Fixes: de8548813824 ("drm/panthor: Add the scheduler logical block")  
> 
> Unfortunately I have to revoke my T-b as we're seeing a pile of
> failures in a CI stress test with this, e.g.
> https://gitlab.freedesktop.org/daniels/mesa/-/jobs/77004047

Note that you need [1] too, which I don't see in your tree. Ashley, a
note for next time: when you have dependencies between patches, like is
the case here, it's usually better to post them in the same patchset,
so that:

1. They are applied in the right order
2. Cherry-pickers/reviewers know that they need to consider both to
have a working branch.

Regards,

Boris

[1]https://lkml.org/lkml/2025/5/15/742