linux-kernel - Re: [PATCH RFC v6 05/26] nova-core: mm: Add support to use PRAMIN windows to write to VRAM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1769737351.3442.2180@nvidia.com>
Date: Thu, 29 Jan 2026 20:45:51 -0500
From: Joel Fernandes <joelagnelf@...dia.com>
To: Gary Guo <gary@...yguo.net>
Cc: Danilo Krummrich <dakr@...nel.org>, Zhi Wang <zhiw@...dia.com>,
	linux-kernel@...r.kernel.org,
	Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
	Maxime Ripard <mripard@...nel.org>,
	Thomas Zimmermann <tzimmermann@...e.de>,
	David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
	Jonathan Corbet <corbet@....net>,
	Alex Deucher <alexander.deucher@....com>,
	Christian Koenig <christian.koenig@....com>,
	Jani Nikula <jani.nikula@...ux.intel.com>,
	Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>,
	Vivi Rodrigo <rodrigo.vivi@...el.com>,
	Tvrtko Ursulin <tursulin@...ulin.net>,
	Rui Huang <ray.huang@....com>,
	Matthew Auld <matthew.auld@...el.com>,
	Matthew Brost <matthew.brost@...el.com>,
	Lucas De Marchi <lucas.demarchi@...el.com>,
	Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
	Helge Deller <deller@....de>, Alice Ryhl <aliceryhl@...gle.com>,
	Miguel Ojeda <ojeda@...nel.org>,
	Alex Gaynor <alex.gaynor@...il.com>,
	Boqun Feng <boqun.feng@...il.com>,
	Bjorn Roy Baron <bjorn3_gh@...tonmail.com>,
	Benno Lossin <lossin@...nel.org>,
	Andreas Hindborg <a.hindborg@...nel.org>,
	Trevor Gross <tmgross@...ch.edu>,
	John Hubbard <jhubbard@...dia.com>,
	Alistair Popple <apopple@...dia.com>, Timur Tabi <ttabi@...dia.com>,
	Edwin Peer <epeer@...dia.com>,
	Alexandre Courbot <acourbot@...dia.com>,
	Andrea Righi <arighi@...dia.com>, Andy Ritger <aritger@...dia.com>,
	Alexey Ivanov <alexeyi@...dia.com>,
	Balbir Singh <balbirs@...dia.com>,
	Philipp Stanner <phasta@...nel.org>,
	Elle Rhumsaa <elle@...thered-steel.dev>,
	Daniel Almeida <daniel.almeida@...labora.com>,
	nouveau@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
	rust-for-linux@...r.kernel.org, linux-doc@...r.kernel.org,
	amd-gfx@...ts.freedesktop.org, intel-gfx@...ts.freedesktop.org,
	intel-xe@...ts.freedesktop.org, linux-fbdev@...r.kernel.org,
	Gary Guo <gary@...yguo.net>
Subject: Re: [PATCH RFC v6 05/26] nova-core: mm: Add support to use PRAMIN
 windows to write to VRAM

> On Jan 29, 2026, at 8:16 PM, Gary Guo <gary@...yguo.net> wrote:
>
> On Fri Jan 30, 2026 at 12:26 AM GMT, Joel Fernandes wrote:
>> Hi, Danilo, all,
>>
>> Based on the below discussion and research, I came up with some deadlock
>> scenarios that we need to handle in the v6 series of these patches. Please let
>> me know if I missed something below. At the moment, off the top I identified
>> that we are doing GFP_KERNEL memory allocations inside GPU buddy allocator
>> during map/unmap. I will work on solutions for that. Thanks.
>>
>> All deadlock scenarios
>> ----------------------
>> The gist is, in the DMA fence signaling critical path we cannot acquire
>> resources (locks or memory allocation etc) that are already acquired when a
>> fence is being waited on to be signaled. So we have to careful which resources
>> we acquire, and also we need to be careful which paths in the driver we do any
>> memory allocations under locks that we need in the dma-fence signaling critical
>> path (when doing the virtual memory map/unmap)
>
> When thinking about deadlocks it usually helps if you think without detailed
> scenarios (which would be hard to enumerate and easy to miss), but rather in
> terms of relative order of resource acquisition. All resources that you wait on
> would need to form a partial order. Any violation could result in deadlocks.
> This is also how lockdep checks.
>
> So to me all cases you listed are all the same...

Hmm, I am quite familiar with lockdep internals, but I don’t see how all cases
are the same one when there are different resources being acquired (locks versus
memory allocation, for instance). I think it helps to visualize different cases
based on different scenarios for a complete understanding of issues and mild
repetition is a good thing IMO - the goal is to not miss anything. But agreed on
that is how lockdep works. Lockdep just needs those relationships in its graph
to know that ordering enough to flag issues. Speaking of lockdep, I have not
checked but we should probably add support for fence signal/wait and resource
dependencies, to catch any potential issues as well.

Thanks for taking a look,

--
Joel Fernandes



>
> Best,
> Gary
>
>>
>> 1. deadlock scenario 1: allocator deadlock (no locking needed to trigger it)
>>
>> Fence Signal start (A) -> Alloc -> MMU notifier/Shrinker (B) -> Fence Wait (A)
>>
>> ABA deadlock.
>>
>> 2. deadlock scenario 2: Same as 1, but ABBA scenario (2 CPUs).
>>
>> CPU 0: Fence Signal start (A) -> Alloc (B)
>>
>> CPU 1: Alloc -> MMU notifier or Shrinker (B) -> Fence Wait (A)
>>
>> 3. deadlock scenario 3: When locking: ABBA (and similarly) deadlock but locking.
>>
>> CPU 0: Fence Signal start (A) -> Lock (B)
>>
>> CPU 1: Lock (B) -> Fence Wait (A)
>>
>> 4. deadlock scenario 4: Same as scenario 3, but the fence wait comes from
>> allocation path.
>>
>> rule: We cannot try to acquire locks in the DMA fence signaling critical path if
>> those locks were already acquire in paths that do reclaimable memory allocations.
>>
>> CPU 0: Fence Signal (A) -> Lock (B)
>>
>> CPU 1: Lock (B) -> Alloc -> Fence Wait (A)
>>
>> 5. deadlock scenario 5: Transitive locking:
>>
>> rule: We cannot try to acquire locks in the DMA fence signaling critical path
>> that are transitively waiting on the same DMA fence.
>>
>> Fence Signal (A) -> Lock (B)
>>
>> Lock (B) -> Lock(C)
>>
>> Lock (C) -> Alloc -> Fence Wait (A)
>>
>> ABBCCA deadlock.
>>
>>
>> --
>> Joel Fernandes
>>
>>> On 1/28/2026 7:04 AM, Danilo Krummrich wrote:
>>> On Fri Jan 23, 2026 at 12:16 AM CET, Joel Fernandes wrote:
>>>> My plan is to make TLB and PRAMIN use immutable references in their function
>>>> calls and then implement internal locking. I've already done this for the GPU
>>>> buddy functions, so it should be doable, and we'll keep it consistent. As a
>>>> result, we will have finer-grain locking on the memory management objects
>>>> instead of requiring to globally lock a common GpuMm object. I'll plan on
>>>> doing this for v7.
>>>>
>>>> Also, the PTE allocation race you mentioned is already handled by PRAMIN
>>>> serialization. Since threads must hold the PRAMIN lock to write page table
>>>> entries, concurrent writers are not possible:
>>>>
>>>>  Thread A: acquire PRAMIN lock
>>>>  Thread A: read PDE (via PRAMIN) -> NULL
>>>>  Thread A: alloc PT page, write PDE
>>>>  Thread A: release PRAMIN lock
>>>>
>>>>  Thread B: acquire PRAMIN lock
>>>>  Thread B: read PDE (via PRAMIN) -> sees A's pointer
>>>>  Thread B: uses existing PT page, no allocation needed
>>>
>>> This won't work unfortunately.
>>>
>>> We have to separate allocations and modifications of the page tabe. Or in other
>>> words, we must not allocate new PDEs or PTEs while holding the lock protecting
>>> the page table from modifications.
>>>
>>> Once we have VM_BIND in nova-drm, we will have the situation that userspace
>>> passes jobs to modify the GPUs virtual address space and hence the page tables.
>>>
>>> Such a jobs has mainly three stages.
>>>
>>>  (1) The submit stage.
>>>
>>>      This is where the job is initialized, dependencies are set up and the
>>>      driver has to pre-allocate all kinds of structures that are required
>>>      throughout the subsequent stages of the job.
>>>
>>>  (2) The run stage.
>>>
>>>      This is the stage where the job is staged for execution and its DMA fence
>>>      has been made public (i.e. it is accessible by userspace).
>>>
>>>      This is the stage where we are in the DMA fence signalling critical
>>>      section, hence we can't do any non-atomic allocations, since otherwise we
>>>      could deadlock in MMU notifier callbacks for instance.
>>>
>>>      This is the stage where the page table is actually modified. Hence, we
>>>      can't acquire any locks that might be held elsewhere while doing
>>>      non-atomic allocations. Also note that this is transitive, e.g. if you
>>>      take lock A and somewhere else a lock B is taked while A is already held
>>>      and we do non-atomic allocations while holding B, then A can't be held in
>>>      the DMA fence signalling critical path either.
>>>
>>>      It is also worth noting that this is the stage where we know the exact
>>>      operations we have to execute based on the VM_BIND request from userspace.
>>>
>>>      For instance, in the submit stage we may only know that userspace wants
>>>      that we map a BO with a certain offset in the GPUs virtual address space
>>>      at [0x0, 0x1000000]. What we don't know is what exact operations this does
>>>      require, i.e. "What do we have to unmap first?", "Are there any
>>>      overlapping mappings that we have to truncate?", etc.
>>>
>>>      So, we have to consider this when we pre-allocate in the submit stage.
>>>
>>>  (3) The cleanup stage.
>>>
>>>      This is where the job has been signaled and hence left the DMA fence
>>>      signalling critical section.
>>>
>>>      In this stage the job is cleaned up, which includes freeing data that is
>>>      not required anymore, such as PTEs and PDEs.
>

-- 
Joel Fernandes

-- 
-- 
Joel Fernandes