[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <18bd737f07773edbf56ee011cd76f953290d1188.camel@mailbox.org>
Date: Wed, 11 Feb 2026 09:16:37 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Christian König <christian.koenig@....com>, Alice
Ryhl <aliceryhl@...gle.com>
Cc: Boris Brezillon <boris.brezillon@...labora.com>, phasta@...nel.org,
Danilo Krummrich <dakr@...nel.org>, David Airlie <airlied@...il.com>,
Simona Vetter <simona@...ll.ch>, Gary Guo <gary@...yguo.net>, Benno Lossin
<lossin@...nel.org>, Daniel Almeida <daniel.almeida@...labora.com>, Joel
Fernandes <joelagnelf@...dia.com>, linux-kernel@...r.kernel.org,
dri-devel@...ts.freedesktop.org, rust-for-linux@...r.kernel.org
Subject: Re: [RFC PATCH 2/4] rust: sync: Add dma_fence abstractions
On Tue, 2026-02-10 at 16:45 +0100, Christian König wrote:
> On 2/10/26 16:07, Alice Ryhl wrote:
> > >
[…]
> > > That doesn't happen in practice.
> > >
> > > For each fence you only have one signaling path you need to guarantee
> > > forward progress for.
> > >
> > > All other signaling paths are just opportunistically optimizations
> > > which *can* signal the fence, but there is no guarantee that they
> > > will.
> > >
> > > We used to have some exceptions to that, especially around aborting
> > > submissions, but those turned out to be a really bad idea as well.
> > >
> > > Thinking more about it you should probably enforce that there is only
> > > one signaling path for each fence signaling.
> >
> > I'm not really convinced by this.
> >
> > First, the timeout path must be a fence signalling path because the
> > reason you have a timeout in the first place is because the hw might
> > never signal the fence. So if the timeout path deadlocks on a
> > kmalloc(GFP_KERNEL) and the hw never comes around to wake you up, boom.
>
> Mhm, good point. On the other hand the timeout handling should probably be considered part of the normal signaling path.
>
> In other words the timeout handler either disables the normal signaling path (e.g. by disabling the interrupt) and then reset the HW or it tells the HW to force signal some work and observes the result.
>
> So it can be that the timeout handler finishes only after the fence is signaled from the normal signaling paths.
I would say since we are designing all this (for now) for modern and
future hardware, the timeout handling regarding GPUs can be considered
trivial?
A timeout event as far as JobQueue is concerned is a mere instruction
to drop the entire queue and close the ring. Further signaling should
either not occur at all anymore (because the ring is blocked by a
broken shader) – or if a racy job still finishes while a timeout is
firing, your problem, then the ring shall still be terminated. It would
then result in that last blocking job being completed for userspace,
and the subsequent once being signalled with -ECANCELED.
In a timeout handler, a driver would just drop its jobqueue, resulting
in all access being revoked, and the JQ deregistering its events from
all fences. Deadlock is being accounted for by RCU.
So no problem here, or am I missing something?
>
> > Second, for the reasons I mentioned you also want the signal-from-irq
> > path to be a fence signalling critical path, because if we allow you to
> > kmalloc(GFP_KERNEL) on the path from getting notification from hardware
> > to signalling the fence, then you may deadlock until the timeout
> > triggers ... even if the deadlock is only temporary, we should still
> > avoid such cases IMO. Thus, the hw signal path should also be a fence
> > signalling critical path.
>
> As far as I remember we didn't had any of such cases.
>
> You can't call kmalloc(GFP_KERNEL) from an interrupt handler, so you would need something like irq->work item->kmalloc(GFP_KERNEL)->signaling and I think that's unlikely to be implemented this way.
>
> But yeah, it is still something which should be prevented somehow.
Just as a side note, we want to ask ourselves what kinds of potential
problems we want to make impossible. 100% might get really work
intensive. I'm in general a fan of the 20-80-Rule, so I'd like to know
what the most severe and most common misuses of dma_fences are.
P.
Powered by blists - more mailing lists