[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <70d88203-2fe1-4bde-b254-51e8107744eb@nvidia.com>
Date: Fri, 30 Jan 2026 16:14:14 -0500
From: Joel Fernandes <joelagnelf@...dia.com>
To: John Hubbard <jhubbard@...dia.com>
Cc: Danilo Krummrich <dakr@...nel.org>, Zhi Wang <zhiw@...dia.com>,
linux-kernel@...r.kernel.org,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
Jonathan Corbet <corbet@....net>, Alex Deucher <alexander.deucher@....com>,
Christian Koenig <christian.koenig@....com>,
Jani Nikula <jani.nikula@...ux.intel.com>,
Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>,
Rodrigo Vivi <rodrigo.vivi@...el.com>, Tvrtko Ursulin
<tursulin@...ulin.net>, Rui Huang <ray.huang@....com>,
Matthew Auld <matthew.auld@...el.com>,
Matthew Brost <matthew.brost@...el.com>,
Lucas De Marchi <lucas.demarchi@...el.com>,
Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
Helge Deller <deller@....de>, Alice Ryhl <aliceryhl@...gle.com>,
Miguel Ojeda <ojeda@...nel.org>, Alex Gaynor <alex.gaynor@...il.com>,
Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
Bjorn Roy Baron <bjorn3_gh@...tonmail.com>, Benno Lossin
<lossin@...nel.org>, Andreas Hindborg <a.hindborg@...nel.org>,
Trevor Gross <tmgross@...ch.edu>, Alistair Popple <apopple@...dia.com>,
Timur Tabi <ttabi@...dia.com>, Edwin Peer <epeer@...dia.com>,
Alexandre Courbot <acourbot@...dia.com>, Andrea Righi <arighi@...dia.com>,
Andy Ritger <aritger@...dia.com>, Alexey Ivanov <alexeyi@...dia.com>,
Balbir Singh <balbirs@...dia.com>, Philipp Stanner <phasta@...nel.org>,
Elle Rhumsaa <elle@...thered-steel.dev>,
Daniel Almeida <daniel.almeida@...labora.com>,
nouveau@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
rust-for-linux@...r.kernel.org, linux-doc@...r.kernel.org,
amd-gfx@...ts.freedesktop.org, intel-gfx@...ts.freedesktop.org,
intel-xe@...ts.freedesktop.org, linux-fbdev@...r.kernel.org
Subject: Re: [PATCH RFC v6 05/26] nova-core: mm: Add support to use PRAMIN
windows to write to VRAM
On 1/29/2026 10:38 PM, John Hubbard wrote:
> On 1/29/26 5:59 PM, Joel Fernandes wrote:
>> On 1/29/26 8:12 PM, John Hubbard wrote:
>>> On 1/29/26 4:26 PM, Joel Fernandes wrote:
>>>> Based on the below discussion and research, I came up with some deadlock
>>>> scenarios that we need to handle in the v6 series of these patches.
>>>> [...]
>>>> memory allocations under locks that we need in the dma-fence signaling
>>>> critical path (when doing the virtual memory map/unmap)
>>>
>>> unmap? Are you seeing any allocations happening during unmap? I don't
>>> immediately see any, but that sounds surprising.
>>
>> Not allocations but we are acquiring locks during unmap. My understanding
>> is (at least some) unmaps have to also be done in the dma fence signaling
>> critical path (the run stage), but Danilo/you can correct me if I am wrong
>> on that. We cannot avoid all locking but those same locks cannot be held in
>> any other paths which do a memory allocation (as mentioned in one of the
>> deadlock scenarios), that is probably the main thing to check for unmap.
>>
>
> Right, OK we are on the same page now: no allocations happening on unmap,
> but it can still deadlock, because the driver is typically going to
> use a single lock to protect calls both map and unmap-related calls
> to the buddy allocator.
Yes exactly!
>
> For the deadlock above, I think a good way to break that deadlock is
> to not allow taking that lock in a fence signaling calling path.
>
> So during an unmap, instead of "lock, unmap/free, unlock" it should
> move the item to a deferred-free list, which is processed separately.
> Of course, this is a little complex, because the allocation and reclaim
> has to be aware of such lists if they get large.
Yes, also avoiding GFP_KERNEL allocations while holding any of these mm locks
(whichever we take during map). The GPU buddy actually does GFP_KERNEL
allocations internally which is problematic.
Some solutions / next steps:
1. allocating (VRAM and system memory) outside mm locks just before acquiring them.
2. pre-allocating both VRAM and system memory needed, before the DMA fence
critical paths (The issue is also to figure out how much memory to pre-allocate
for the page table pages based on the VM_BIND request. I think we can analyze
the page tables in the submit stage to make an estimate).
3. Unfortunately, I am using gpu-buddy when allocating a VA range in the Vmm
(called virt_buddy), which itself does GFP_KERNEL memory allocations in the
allocate path. I am not sure what do yet about this. ISTR the maple tree also
has similar issues.
4. Using non-reclaimable memory allocations where pre-allocation or
pre-allocated memory pools is not possible (I'd like to avoid this #4 so we
don't fail allocations when memory is scarce).
Will work on these issues for the v7. Thanks,
--
Joel Fernandes
Powered by blists - more mailing lists