linux-kernel - Re: [PATCH v5 1/2] drm/panthor: Reset queue slots if termination fails

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250604132418.6685ff7c@collabora.com>
Date: Wed, 4 Jun 2025 13:24:18 +0200
From: Boris Brezillon <boris.brezillon@...labora.com>
To: Ashley Smith <ashley.smith@...labora.com>
Cc: "Liviu Dudau" <liviu.dudau@....com>, "Steven Price"
 <steven.price@....com>, "Maarten Lankhorst"
 <maarten.lankhorst@...ux.intel.com>, "Maxime Ripard" <mripard@...nel.org>,
 "Thomas Zimmermann" <tzimmermann@...e.de>, "David Airlie"
 <airlied@...il.com>, "Simona Vetter" <simona@...ll.ch>, "kernel"
 <kernel@...labora.com>, "open list:ARM MALI PANTHOR DRM DRIVER"
 <dri-devel@...ts.freedesktop.org>, "open list"
 <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v5 1/2] drm/panthor: Reset queue slots if termination
 fails

On Wed, 04 Jun 2025 11:01:27 +0100
Ashley Smith <ashley.smith@...labora.com> wrote:

> On Tue, 03 Jun 2025 12:09:44 +0100 Liviu Dudau <liviu.dudau@....com>
> wrote:
>  > On Tue, Jun 03, 2025 at 10:49:31AM +0100, Ashley Smith wrote:   
>  > > This fixes a bug where if we timeout after a suspend and the
>  > > termination fails, due to waiting on a fence that will never be
>  > > signalled for example, we do not resume the group correctly. The
>  > > fix forces a reset for groups that are not terminated correctly.
>  > >   
>  >  
>  > I have a question on the commit message: you're describing a
>  > situation where a fence will *never* be signalled. Is that a real
>  > example? I thought this is not supposed to ever happen! Or are you
>  > trying to say that the fence signalling happens after the timeout?
>  >  
> 
> This covers cases where a fence is never signalled.

Fence not being signaled in time is covered by the job timeout. What
you're fixing here is the FW not responding to a CSG SUSPEND/TERMINATE
request, which is different.

> It shouldn't
> happen, but we have found this in some situations with a FW hang.

This ^.

> Since queue_suspend_timeout() is only called on state update, if a
> suspend/terminate fails due to a FW hang for example this will leave
> delayed work, possibly leading to an incorrect queue_timeout_work().

Nah, that's not true until the second patch in this series is applied,
because drm_sched_stop(), which is called in the
panthor_sched_pre_reset() path, takes care of suspending the drm_sched
timer. What's still problematic if we don't call cs_slot_reset_locked()
here is that we don't re-initialize the FW CS slots, and since
cs_slot_prog_locked() is only called on active queues, we might end up
with an unused CS slot that still has values from the previous user of
this slot. Not sure how harmful this is in practice, but it's certainly
not great.

The true reason we have a Fixes tag is because the second patch has
a Fixes tag too, and it relies on the new driver timer being stopped
even if the FW hangs on a TERMINATE request (queue_suspend_timeout() is
called in cs_slot_reset_locked()). So, either we merge the two patches
back, like was done in v2, or we have to flag both as Fixes.