[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c0a3ac65-e2e5-4b62-bc75-49b1599e160f@nvidia.com>
Date: Wed, 28 Jan 2026 10:27:07 -0500
From: Joel Fernandes <joelagnelf@...dia.com>
To: Danilo Krummrich <dakr@...nel.org>
Cc: Zhi Wang <zhiw@...dia.com>, linux-kernel@...r.kernel.org,
Maarten Lankhorst <maarten.lankhorst@...ux.intel.com>,
Maxime Ripard <mripard@...nel.org>, Thomas Zimmermann <tzimmermann@...e.de>,
David Airlie <airlied@...il.com>, Simona Vetter <simona@...ll.ch>,
Jonathan Corbet <corbet@....net>, Alex Deucher <alexander.deucher@....com>,
Christian Koenig <christian.koenig@....com>,
Jani Nikula <jani.nikula@...ux.intel.com>,
Joonas Lahtinen <joonas.lahtinen@...ux.intel.com>,
Rodrigo Vivi <rodrigo.vivi@...el.com>, Tvrtko Ursulin
<tursulin@...ulin.net>, Huang Rui <ray.huang@....com>,
Matthew Auld <matthew.auld@...el.com>,
Matthew Brost <matthew.brost@...el.com>,
Lucas De Marchi <lucas.demarchi@...el.com>,
Thomas Hellstrom <thomas.hellstrom@...ux.intel.com>,
Helge Deller <deller@....de>, Alice Ryhl <aliceryhl@...gle.com>,
Miguel Ojeda <ojeda@...nel.org>, Alex Gaynor <alex.gaynor@...il.com>,
Boqun Feng <boqun.feng@...il.com>, Gary Guo <gary@...yguo.net>,
Bjorn Roy Baron <bjorn3_gh@...tonmail.com>, Benno Lossin
<lossin@...nel.org>, Andreas Hindborg <a.hindborg@...nel.org>,
Trevor Gross <tmgross@...ch.edu>, John Hubbard <jhubbard@...dia.com>,
Alistair Popple <apopple@...dia.com>, Timur Tabi <ttabi@...dia.com>,
Edwin Peer <epeer@...dia.com>, Alexandre Courbot <acourbot@...dia.com>,
Andrea Righi <arighi@...dia.com>, Andy Ritger <aritger@...dia.com>,
Alexey Ivanov <alexeyi@...dia.com>, Balbir Singh <balbirs@...dia.com>,
Philipp Stanner <phasta@...nel.org>, Elle Rhumsaa
<elle@...thered-steel.dev>, Daniel Almeida <daniel.almeida@...labora.com>,
nouveau@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
rust-for-linux@...r.kernel.org, linux-doc@...r.kernel.org,
amd-gfx@...ts.freedesktop.org, intel-gfx@...ts.freedesktop.org,
intel-xe@...ts.freedesktop.org, linux-fbdev@...r.kernel.org
Subject: Re: [PATCH RFC v6 05/26] nova-core: mm: Add support to use PRAMIN
windows to write to VRAM
On 1/28/2026 7:04 AM, Danilo Krummrich wrote:
> On Fri Jan 23, 2026 at 12:16 AM CET, Joel Fernandes wrote:
>> My plan is to make TLB and PRAMIN use immutable references in their function
>> calls and then implement internal locking. I've already done this for the GPU
>> buddy functions, so it should be doable, and we'll keep it consistent. As a
>> result, we will have finer-grain locking on the memory management objects
>> instead of requiring to globally lock a common GpuMm object. I'll plan on
>> doing this for v7.
>>
>> Also, the PTE allocation race you mentioned is already handled by PRAMIN
>> serialization. Since threads must hold the PRAMIN lock to write page table
>> entries, concurrent writers are not possible:
>>
>> Thread A: acquire PRAMIN lock
>> Thread A: read PDE (via PRAMIN) -> NULL
>> Thread A: alloc PT page, write PDE
>> Thread A: release PRAMIN lock
>>
>> Thread B: acquire PRAMIN lock
>> Thread B: read PDE (via PRAMIN) -> sees A's pointer
>> Thread B: uses existing PT page, no allocation needed
>
> This won't work unfortunately.
>
> We have to separate allocations and modifications of the page tabe. Or in other
> words, we must not allocate new PDEs or PTEs while holding the lock protecting
> the page table from modifications.
I will go over these concerns, just to clarify - do you mean forbidding
*any* lock or do you mean only forbidding non-atomic locks? I believe we
can avoid non-atomic locks completely - actually I just wrote a patch
before I read this email to do just. If we are to forbid any locking at
all, that might require some careful redesign to handle the above race
afaics.
>
> Once we have VM_BIND in nova-drm, we will have the situation that userspace
> passes jobs to modify the GPUs virtual address space and hence the page tables.
Thanks for listing all the concerns below, this is very valuable. I will
go over all these and all cases before posting the v7 now that I have this.
--
Joel Fernandes
> Such a jobs has mainly three stages.
>
> (1) The submit stage.
>
> This is where the job is initialized, dependencies are set up and the
> driver has to pre-allocate all kinds of structures that are required
> throughout the subsequent stages of the job.
>
> (2) The run stage.
>
> This is the stage where the job is staged for execution and its DMA fence
> has been made public (i.e. it is accessible by userspace).
>
> This is the stage where we are in the DMA fence signalling critical
> section, hence we can't do any non-atomic allocations, since otherwise we
> could deadlock in MMU notifier callbacks for instance.
>
> This is the stage where the page table is actually modified. Hence, we
> can't acquire any locks that might be held elsewhere while doing
> non-atomic allocations. Also note that this is transitive, e.g. if you
> take lock A and somewhere else a lock B is taked while A is already held
> and we do non-atomic allocations while holding B, then A can't be held in
> the DMA fence signalling critical path either.
>
> It is also worth noting that this is the stage where we know the exact
> operations we have to execute based on the VM_BIND request from userspace.
>
> For instance, in the submit stage we may only know that userspace wants
> that we map a BO with a certain offset in the GPUs virtual address space
> at [0x0, 0x1000000]. What we don't know is what exact operations this does
> require, i.e. "What do we have to unmap first?", "Are there any
> overlapping mappings that we have to truncate?", etc.
>
> So, we have to consider this when we pre-allocate in the submit stage.
>
> (3) The cleanup stage.
>
> This is where the job has been signaled and hence left the DMA fence
> signalling critical section.
>
> In this stage the job is cleaned up, which includes freeing data that is
> not required anymore, such as PTEs and PDEs.
Powered by blists - more mailing lists