[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPf+ZLJ2KIsz+lZx@lstrano-desk.jf.intel.com>
Date: Tue, 21 Oct 2025 14:43:00 -0700
From: Matthew Brost <matthew.brost@...el.com>
To: Rob Herring <robh@...nel.org>
CC: Tomeu Vizoso <tomeu@...euvizoso.net>, Krzysztof Kozlowski
<krzk+dt@...nel.org>, Conor Dooley <conor+dt@...nel.org>, Oded Gabbay
<ogabbay@...nel.org>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>, "Sumit
Semwal" <sumit.semwal@...aro.org>, Christian König
<christian.koenig@....com>, Robin Murphy <robin.murphy@....com>, Steven Price
<steven.price@....com>, Daniel Stone <daniel@...ishbar.org>, Frank Li
<Frank.li@....com>, Sui Jingfeng <sui.jingfeng@...ux.dev>,
<devicetree@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<dri-devel@...ts.freedesktop.org>, <linux-media@...r.kernel.org>,
<linaro-mm-sig@...ts.linaro.org>
Subject: Re: [PATCH v5 2/2] accel: Add Arm Ethos-U NPU driver
On Sat, Oct 18, 2025 at 12:42:30AM -0700, Matthew Brost wrote:
> On Fri, Oct 17, 2025 at 11:43:51PM -0700, Matthew Brost wrote:
> > On Fri, Oct 17, 2025 at 10:37:46AM -0500, Rob Herring wrote:
> > > On Thu, Oct 16, 2025 at 11:25:34PM -0700, Matthew Brost wrote:
> > > > On Thu, Oct 16, 2025 at 04:06:05PM -0500, Rob Herring (Arm) wrote:
> > > > > Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a
> > > > > relatively simple interface with single command stream to describe
> > > > > buffers, operation settings, and network operations. It supports up to 8
> > > > > memory regions (though no h/w bounds on a region). The Ethos NPUs
> > > > > are designed to use an SRAM for scratch memory. Region 2 is reserved
> > > > > for SRAM (like the downstream driver stack and compiler). Userspace
> > > > > doesn't need access to the SRAM.
> > >
> > > Thanks for the review.
> > >
> > > [...]
> > >
> > > > > +static struct dma_fence *ethosu_job_run(struct drm_sched_job *sched_job)
> > > > > +{
> > > > > + struct ethosu_job *job = to_ethosu_job(sched_job);
> > > > > + struct ethosu_device *dev = job->dev;
> > > > > + struct dma_fence *fence = NULL;
> > > > > + int ret;
> > > > > +
> > > > > + if (unlikely(job->base.s_fence->finished.error))
> > > > > + return NULL;
> > > > > +
> > > > > + fence = ethosu_fence_create(dev);
> > > >
> > > > Another reclaim issue: ethosu_fence_create allocates memory using
> > > > GFP_KERNEL. Since we're already in the DMA fence signaling path
> > > > (reclaim), this can lead to a deadlock.
> > > >
> > > > Without too much thought, you likely want to move this allocation to
> > > > ethosu_job_do_push, but before taking dev->sched_lock or calling
> > > > drm_sched_job_arm.
> > > >
> > > > We really should fix the DRM scheduler work queue to be tainted with
> > > > reclaim. If I recall correctly, we'd need to update the work queue
> > > > layer. Let me look into that—I've seen this type of bug several times,
> > > > and lockdep should be able to catch it.
> > >
> > > Likely the rocket driver suffers from the same issues...
> > >
> >
> > I am not surprised by this statement.
> >
> > > >
> > > > > + if (IS_ERR(fence))
> > > > > + return fence;
> > > > > +
> > > > > + if (job->done_fence)
> > > > > + dma_fence_put(job->done_fence);
> > > > > + job->done_fence = dma_fence_get(fence);
> > > > > +
> > > > > + ret = pm_runtime_get_sync(dev->base.dev);
> > > >
> > > > I haven't looked at your PM design, but this generally looks quite
> > > > dangerous with respect to reclaim. For example, if your PM resume paths
> > > > allocate memory or take locks that allocate memory underneath, you're
> > > > likely to run into issues.
> > > >
> > > > A better approach would be to attach a PM reference to your job upon
> > > > creation and release it upon job destruction. That would be safer and
> > > > save you headaches in the long run.
> > >
> > > Our PM is nothing more than clock enable/disable and register init.
> > >
> > > If the runtime PM API doesn't work and needs special driver wrappers,
> > > then I'm inclined to just not use it and manage clocks directly (as
> > > that's all it is doing).
> > >
> >
> > Yes, then you’re probably fine. More complex drivers can do all sorts of
> > things during a PM wake, which is why PM wakes should generally be the
> > outermost layer. I still suggest, to future-proof your code, that you
> > move the PM reference to an outer layer.
> >
>
> Also, taking a PM reference in a function call — as opposed to tying it
> to a object's lifetime — is risky. It can quickly lead to imbalances in
> PM references if things go sideways or function calls become unbalanced.
> Depending on how your driver uses the DRM scheduler, this seems like a
> real possibility.
>
> Matt
>
> > > >
> > > > This is what we do in Xe [1] [2].
> > > >
> > > > Also, in general, this driver has been reviewed (RB’d), but it's not
> > > > great that I spotted numerous issues within just five minutes. I suggest
> > > > taking a step back and thoroughly evaluating everything this driver is
> > > > doing.
> > >
> > > Well, if it is hard to get simple drivers right, then it's a problem
> > > with the subsystem APIs IMO.
> > >
> >
> > Yes, agreed. We should have assertions and lockdep annotations in place
> > to catch driver-side misuses. This is the second driver I’ve randomly
> > looked at over the past year that has broken DMA fencing and reclaim
> > rules. I’ll take an action item to fix this in the DRM scheduler, but
> > I’m afraid I’ll likely break multiple drivers in the process as misuess
> > / lockdep will complain.
I've posted a series [1] for the DRM scheduler which will complain about the
things I've pointed out here.
Matt
[1] https://patchwork.freedesktop.org/series/156284/
> >
> > Matt
> >
> > > Rob
Powered by blists - more mailing lists