linux-kernel - Re: [WARNING][AMDGPU] WQ_MEM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <Z2MPu9oQNdXXhJkc@slm.duckdns.org>
Date: Wed, 18 Dec 2024 08:08:59 -1000
From: Tejun Heo <tj@...nel.org>
To: Matthew Brost <matthew.brost@...el.com>
Cc: Alex Deucher <alexdeucher@...il.com>, Chris Rankin <rankincj@...il.com>,
	Christian Koenig <christian.koenig@....com>,
	Tvrtko Ursulin <tvrtko.ursulin@...lia.com>,
	LKML <linux-kernel@...r.kernel.org>, amd-gfx@...ts.freedesktop.org
Subject: Re: [WARNING][AMDGPU] WQ_MEM_RECLAIM with Radeon RX 6600

Hello, sorry about the delay.

On Mon, Dec 16, 2024 at 04:34:00PM -0800, Matthew Brost wrote:
> > However, after further discussion, I think the warning is actually a
> > false positive.  See this discussion:
> > https://lists.freedesktop.org/archives/amd-gfx/2024-November/117349.html
> > 
> > From the thread:
> > "Question is - does check_flush_dependency() need to skip the
> > !WQ_MEM_RECLAIM flushing WQ_MEM_RECLAIM warning *if* the work is already
> > running *and* it was called from cancel_delayed_work_sync()?"
> > 
> 
> See my reply just now [1] — I’m going to have to disagree with AMD's
> assessment, but I’m not certain.
> 
> Again, I believe Tejun is the authority here.

I think we can skip the warning if the flushing is coming from
cancel*_work_sync() as flush takes place iff the work item already has a
worker running - ie. it can't be blocked from lack of memory. Tvrtko, can
you write up a patch to exclude the condition from check_flush_dependency()?
I think it can just skip check_flush_dependency() when @from_cancel is set.

Taking a step back, if an actual dependency develops in the future - memory
reclaim actually blocking on gpu work items, one way to handle that would be
adding subsystem-wide workqueues so that the rescuer can be shared across
GPU drivers / devices. As long as they don't depend on each other for making
forward progress, which they most likely wouldn't, sharing a rescuer across
them is completely fine.

Thanks.

-- 
tejun