lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPM3J2jZcct7ODIp@lstrano-desk.jf.intel.com>
Date: Fri, 17 Oct 2025 23:43:51 -0700
From: Matthew Brost <matthew.brost@...el.com>
To: Rob Herring <robh@...nel.org>
CC: Tomeu Vizoso <tomeu@...euvizoso.net>, Krzysztof Kozlowski
	<krzk+dt@...nel.org>, Conor Dooley <conor+dt@...nel.org>, Oded Gabbay
	<ogabbay@...nel.org>, Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
	Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
	David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>, "Sumit
 Semwal" <sumit.semwal@...aro.org>, Christian König
	<christian.koenig@....com>, Robin Murphy <robin.murphy@....com>, Steven Price
	<steven.price@....com>, Daniel Stone <daniel@...ishbar.org>, Frank Li
	<Frank.li@....com>, Sui Jingfeng <sui.jingfeng@...ux.dev>,
	<devicetree@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<dri-devel@...ts.freedesktop.org>, <linux-media@...r.kernel.org>,
	<linaro-mm-sig@...ts.linaro.org>
Subject: Re: [PATCH v5 2/2] accel: Add Arm Ethos-U NPU driver

On Fri, Oct 17, 2025 at 10:37:46AM -0500, Rob Herring wrote:
> On Thu, Oct 16, 2025 at 11:25:34PM -0700, Matthew Brost wrote:
> > On Thu, Oct 16, 2025 at 04:06:05PM -0500, Rob Herring (Arm) wrote:
> > > Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a
> > > relatively simple interface with single command stream to describe
> > > buffers, operation settings, and network operations. It supports up to 8
> > > memory regions (though no h/w bounds on a region). The Ethos NPUs
> > > are designed to use an SRAM for scratch memory. Region 2 is reserved
> > > for SRAM (like the downstream driver stack and compiler). Userspace
> > > doesn't need access to the SRAM.
> 
> Thanks for the review.
> 
> [...]
> 
> > > +static struct dma_fence *ethosu_job_run(struct drm_sched_job *sched_job)
> > > +{
> > > +	struct ethosu_job *job = to_ethosu_job(sched_job);
> > > +	struct ethosu_device *dev = job->dev;
> > > +	struct dma_fence *fence = NULL;
> > > +	int ret;
> > > +
> > > +	if (unlikely(job->base.s_fence->finished.error))
> > > +		return NULL;
> > > +
> > > +	fence = ethosu_fence_create(dev);
> > 
> > Another reclaim issue: ethosu_fence_create allocates memory using
> > GFP_KERNEL. Since we're already in the DMA fence signaling path
> > (reclaim), this can lead to a deadlock.
> > 
> > Without too much thought, you likely want to move this allocation to
> > ethosu_job_do_push, but before taking dev->sched_lock or calling
> > drm_sched_job_arm.
> > 
> > We really should fix the DRM scheduler work queue to be tainted with
> > reclaim. If I recall correctly, we'd need to update the work queue
> > layer. Let me look into that—I've seen this type of bug several times,
> > and lockdep should be able to catch it.
> 
> Likely the rocket driver suffers from the same issues...
> 

I am not surprised by this statement.

> > 
> > > +	if (IS_ERR(fence))
> > > +		return fence;
> > > +
> > > +	if (job->done_fence)
> > > +		dma_fence_put(job->done_fence);
> > > +	job->done_fence = dma_fence_get(fence);
> > > +
> > > +	ret = pm_runtime_get_sync(dev->base.dev);
> > 
> > I haven't looked at your PM design, but this generally looks quite
> > dangerous with respect to reclaim. For example, if your PM resume paths
> > allocate memory or take locks that allocate memory underneath, you're
> > likely to run into issues.
> > 
> > A better approach would be to attach a PM reference to your job upon
> > creation and release it upon job destruction. That would be safer and
> > save you headaches in the long run.
> 
> Our PM is nothing more than clock enable/disable and register init. 
> 
> If the runtime PM API doesn't work and needs special driver wrappers, 
> then I'm inclined to just not use it and manage clocks directly (as 
> that's all it is doing).
> 

Yes, then you’re probably fine. More complex drivers can do all sorts of
things during a PM wake, which is why PM wakes should generally be the
outermost layer. I still suggest, to future-proof your code, that you
move the PM reference to an outer layer.

> > 
> > This is what we do in Xe [1] [2].
> > 
> > Also, in general, this driver has been reviewed (RB’d), but it's not
> > great that I spotted numerous issues within just five minutes. I suggest
> > taking a step back and thoroughly evaluating everything this driver is
> > doing.
> 
> Well, if it is hard to get simple drivers right, then it's a problem 
> with the subsystem APIs IMO.
> 

Yes, agreed. We should have assertions and lockdep annotations in place
to catch driver-side misuses. This is the second driver I’ve randomly
looked at over the past year that has broken DMA fencing and reclaim
rules. I’ll take an action item to fix this in the DRM scheduler, but
I’m afraid I’ll likely break multiple drivers in the process as misuess
/ lockdep will complain. 

Matt

> Rob

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ