[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2067093.PIDvDuAF1L@jkrzyszt-mobl2.ger.corp.intel.com>
Date: Tue, 19 Aug 2025 11:04:11 +0200
From: Janusz Krzysztofik <janusz.krzysztofik@...ux.intel.com>
To: Christian König <christian.koenig@....com>
Cc: Sumit Semwal <sumit.semwal@...aro.org>,
Gustavo Padovan <gustavo@...ovan.org>,
Chris Wilson <chris.p.wilson@...ux.intel.com>, linux-media@...r.kernel.org,
dri-devel@...ts.freedesktop.org, linaro-mm-sig@...ts.linaro.org,
linux-kernel@...r.kernel.org, intel-gfx@...ts.freedesktop.org,
intel-xe@...ts.freedesktop.org,
Janusz Krzysztofik <janusz.krzysztofik@...ux.intel.com>
Subject:
Re: [PATCH 4/4] dma-buf/fence-chain: Speed up processing of rearmed callbacks
On Monday, 18 August 2025 16:42:56 CEST Christian König wrote:
> On 18.08.25 16:30, Janusz Krzysztofik wrote:
> > Hi Christian,
> >
> > On Thursday, 14 August 2025 14:24:29 CEST Christian König wrote:
> >>
> >> On 14.08.25 10:16, Janusz Krzysztofik wrote:
> >>> When first user starts waiting on a not yet signaled fence of a chain
> >>> link, a dma_fence_chain callback is added to a user fence of that link.
> >>> When the user fence of that chain link is then signaled, the chain is
> >>> traversed in search for a first not signaled link and the callback is
> >>> rearmed on a user fence of that link.
> >>>
> >>> Since chain fences may be exposed to user space, e.g. over drm_syncobj
> >>> IOCTLs, users may start waiting on any link of the chain, then many links
> >>> of a chain may have signaling enabled and their callbacks added to their
> >>> user fences. Once an arbitrary user fence is signaled, all
> >>> dma_fence_chain callbacks added to it so far must be rearmed to another
> >>> user fence of the chain. In extreme scenarios, when all N links of a
> >>> chain are awaited and then signaled in reverse order, the dma_fence_chain
> >>> callback may be called up to N * (N + 1) / 2 times (an arithmetic series).
> >>>
> >>> To avoid that potential excessive accumulation of dma_fence_chain
> >>> callbacks, rearm a trimmed-down, signal only callback version to the base
> >>> fence of a previous link, if not yet signaled, otherwise just signal the
> >>> base fence of the current link instead of traversing the chain in search
> >>> for a first not signaled link and moving all callbacks collected so far to
> >>> a user fence of that link.
> >>
> >> Well clear NAK to that! You can easily overflow the kernel stack with that!
> >
> > I'll be happy to propose a better solution, but for that I need to understand
> > better your message. Could you please point out an exact piece of the
> > proposed code and/or describe a scenario where you can see the risk of stack
> > overflow?
>
> The sentence "rearm .. to the base fence of a previous link" sounds like you are trying to install a callback on the signaling to the previous chain element.
>
> That is exactly what I pointed out previously where you need to be super careful because when this chain signals the callbacks will execute recursively which means that you can trivially overflow the kernel stack if you have more than a handful of chain elements.
>
> In other words A waits for B, B waits for C, C waits for D etc.... when D finally signals it will call C which in turn calls B which in turn calls A.
OK, maybe my commit description was not precise enough, however, I didn't
describe implementation details (how) intentionally.
When D signals then it doesn't call C directly, only it submits an irq work
that calls C. Then C doesn't just call B, only it submits another irq work
that calls B, and so on.
Doesn't that code pattern effectively break the recursion loop into separate
work items, each with its own separate stack?
>
> Even if the chain is a recursive data structure you absolutely can't use recursion for the handling of it.
>
> Maybe I misunderstood your textual description but reading a sentence like this rings all alarm bells here. Otherwise I can't see what the patch is supposed to be optimizing.
OK, maybe I should start my commit description of this patch with a copy of
the first sentence from cover letter and also from patch 1/4 description that
informs about the problem as reported by CI. Maybe I should also provide a
comparison of measured signaling times from trybot executions [1][2][3].
Here are example numbers from CI machine fi-bsw-n3050:
With signaling time reports only added to selftests (patch 1 of 4):
<6> [777.914451] dma-buf: Running dma_fence_chain/wait_forward
<6> [778.123516] wait_forward: 4096 signals in 21373487 ns
<6> [778.335709] dma-buf: Running dma_fence_chain/wait_backward
<6> [795.791546] wait_backward: 4096 signals in 17249051192 ns
<6> [795.859699] dma-buf: Running dma_fence_chain/wait_random
<6> [796.161375] wait_random: 4096 signals in 97386256 ns
With dma_fence_enable_signaling() replaced in selftests with dma_fence_wait()
(patches 1-3 of 4):
<6> [782.505692] dma-buf: Running dma_fence_chain/wait_forward
<6> [784.609213] wait_forward: 4096 signals in 36513103 ns
<3> [784.837226] Reported -4 for kthread_stop_put(0)!
<6> [785.147643] dma-buf: Running dma_fence_chain/wait_backward
<6> [806.367763] wait_backward: 4096 signals in 18428009499 ns
<6> [807.175325] dma-buf: Running dma_fence_chain/wait_random
<6> [809.453942] wait_random: 4096 signals in 119761950 ns
With the fix (patches 1-4 of 4):
<6> [731.519020] dma-buf: Running dma_fence_chain/wait_forward
<6> [733.623375] wait_forward: 4096 signals in 31890220 ns
<6> [734.258972] dma-buf: Running dma_fence_chain/wait_backward
<6> [736.267325] wait_backward: 4096 signals in 39007955 ns
<6> [736.700221] dma-buf: Running dma_fence_chain/wait_random
<6> [739.346706] wait_random: 4096 signals in 48384865 ns
Signaling time in wait_backward selftest has been reduced from 17s to 39ms.
[1] https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_152785v1/index.html?
[2] https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_152828v2/index.html?
[3] https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_152830v2/index.html?
>
> >>
> >> Additional to this messing with the fence ops outside of the dma_fence code is an absolute no-go.
> >
> > Could you please explain what piece of code you are referring to when you say
> > "messing with the fence ops outside the dma_fence code"? If not this patch
> > then which particular one of this series did you mean? I'm assuming you
> > didn't mean drm_syncobj code that I mentioned in my commit descriptions.
>
> See below.
>
> >
> > Thanks,
> > Janusz
> >
> >>
> >> Regards,
> >> Christian.
> >>
> >>>
> >>> Closes: https://gitlab.freedesktop.org/drm/i915/kernel/-/issues/12904
> >>> Suggested-by: Chris Wilson <chris.p.wilson@...ux.intel.com>
> >>> Signed-off-by: Janusz Krzysztofik <janusz.krzysztofik@...ux.intel.com>
> >>> ---
> >>> drivers/dma-buf/dma-fence-chain.c | 101 +++++++++++++++++++++++++-----
> >>> 1 file changed, 84 insertions(+), 17 deletions(-)
> >>>
> >>> diff --git a/drivers/dma-buf/dma-fence-chain.c b/drivers/dma-buf/dma-fence-chain.c
> >>> index a8a90acf4f34d..90eff264ee05c 100644
> >>> --- a/drivers/dma-buf/dma-fence-chain.c
> >>> +++ b/drivers/dma-buf/dma-fence-chain.c
> >>> @@ -119,46 +119,113 @@ static const char *dma_fence_chain_get_timeline_name(struct dma_fence *fence)
> >>> return "unbound";
> >>> }
> >>>
> >>> -static void dma_fence_chain_irq_work(struct irq_work *work)
> >>> +static void signal_irq_work(struct irq_work *work)
> >>> {
> >>> struct dma_fence_chain *chain;
> >>>
> >>> chain = container_of(work, typeof(*chain), work);
> >>>
> >>> - /* Try to rearm the callback */
> >>> - if (!dma_fence_chain_enable_signaling(&chain->base))
> >>> - /* Ok, we are done. No more unsignaled fences left */
> >>> - dma_fence_signal(&chain->base);
> >>> + dma_fence_signal(&chain->base);
> >>> dma_fence_put(&chain->base);
> >>> }
> >>>
> >>> -static void dma_fence_chain_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> >>> +static void signal_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> >>> +{
> >>> + struct dma_fence_chain *chain;
> >>> +
> >>> + chain = container_of(cb, typeof(*chain), cb);
> >>> + init_irq_work(&chain->work, signal_irq_work);
> >>> + irq_work_queue(&chain->work);
> >>> +}
> >>> +
> >>> +static void rearm_irq_work(struct irq_work *work)
> >>> +{
> >>> + struct dma_fence_chain *chain;
> >>> + struct dma_fence *prev;
> >>> +
> >>> + chain = container_of(work, typeof(*chain), work);
> >>> +
> >>> + rcu_read_lock();
> >>> + prev = rcu_dereference(chain->prev);
> >>> + if (prev && dma_fence_add_callback(prev, &chain->cb, signal_cb))
> >>> + prev = NULL;
> >>> + rcu_read_unlock();
> >>> + if (prev)
> >>> + return;
> >>> +
> >>> + /* Ok, we are done. No more unsignaled fences left */
> >>> + signal_irq_work(work);
> >>> +}
> >>> +
> >>> +static inline bool fence_is_signaled__nested(struct dma_fence *fence)
> >>> +{
> >>> + if (test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags))
> >>> + return true;
> >>> +
>
> >>> + if (fence->ops->signaled && fence->ops->signaled(fence)) {
>
> Calling this outside of dma-fence.[ch] is a clear no-go.
But this patch applies only to drivers/dma-buf/dma-fence-chain.c, not
outside of it.
Thanks,
Janusz
>
> Regards,
> Christian.
>
> >>> + unsigned long flags;
> >>> +
> >>> + spin_lock_irqsave_nested(fence->lock, flags, SINGLE_DEPTH_NESTING);
> >>> + dma_fence_signal_locked(fence);
> >>> + spin_unlock_irqrestore(fence->lock, flags);
> >>> +
> >>> + return true;
> >>> + }
> >>> +
> >>> + return false;
> >>> +}
> >>> +
> >>> +static bool prev_is_signaled(struct dma_fence_chain *chain)
> >>> +{
> >>> + struct dma_fence *prev;
> >>> + bool result;
> >>> +
> >>> + rcu_read_lock();
> >>> + prev = rcu_dereference(chain->prev);
> >>> + result = !prev || fence_is_signaled__nested(prev);
> >>> + rcu_read_unlock();
> >>> +
> >>> + return result;
> >>> +}
> >>> +
> >>> +static void rearm_or_signal_cb(struct dma_fence *f, struct dma_fence_cb *cb)
> >>> {
> >>> struct dma_fence_chain *chain;
> >>>
> >>> chain = container_of(cb, typeof(*chain), cb);
> >>> - init_irq_work(&chain->work, dma_fence_chain_irq_work);
> >>> + if (prev_is_signaled(chain)) {
> >>> + /* Ok, we are done. No more unsignaled fences left */
> >>> + init_irq_work(&chain->work, signal_irq_work);
> >>> + } else {
> >>> + /* Try to rearm the callback */
> >>> + init_irq_work(&chain->work, rearm_irq_work);
> >>> + }
> >>> +
> >>> irq_work_queue(&chain->work);
> >>> - dma_fence_put(f);
> >>> }
> >>>
> >>> static bool dma_fence_chain_enable_signaling(struct dma_fence *fence)
> >>> {
> >>> struct dma_fence_chain *head = to_dma_fence_chain(fence);
> >>> + int err = -ENOENT;
> >>>
> >>> - dma_fence_get(&head->base);
> >>> - dma_fence_chain_for_each(fence, &head->base) {
> >>> - struct dma_fence *f = dma_fence_chain_contained(fence);
> >>> + if (WARN_ON(!head))
> >>> + return false;
> >>>
> >>> - dma_fence_get(f);
> >>> - if (!dma_fence_add_callback(f, &head->cb, dma_fence_chain_cb)) {
> >>> + dma_fence_get(fence);
> >>> + if (head->fence)
> >>> + err = dma_fence_add_callback(head->fence, &head->cb, rearm_or_signal_cb);
> >>> + if (err) {
> >>> + if (prev_is_signaled(head)) {
> >>> dma_fence_put(fence);
> >>> - return true;
> >>> + } else {
> >>> + init_irq_work(&head->work, rearm_irq_work);
> >>> + irq_work_queue(&head->work);
> >>> + err = 0;
> >>> }
> >>> - dma_fence_put(f);
> >>> }
> >>> - dma_fence_put(&head->base);
> >>> - return false;
> >>> +
> >>> + return !err;
> >>> }
> >>>
> >>> static bool dma_fence_chain_signaled(struct dma_fence *fence)
> >>
> >>
> >
> >
> >
> >
>
>
Powered by blists - more mailing lists