linux-kernel - Re: [PATCH v4 1/7] drm/panthor: Add support for atomic page table updates

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250821135127.2827abfb@fedora>
Date: Thu, 21 Aug 2025 13:51:27 +0200
From: Boris Brezillon <boris.brezillon@...labora.com>
To: Steven Price <steven.price@....com>
Cc: Caterina Shablia <caterina.shablia@...labora.com>, Maarten Lankhorst
 <maarten.lankhorst@...ux.intel.com>, Maxime Ripard <mripard@...nel.org>,
 Thomas Zimmermann <tzimmermann@...e.de>, David Airlie <airlied@...il.com>,
 Simona Vetter <simona@...ll.ch>, Frank Binns <frank.binns@...tec.com>, Matt
 Coster <matt.coster@...tec.com>, Karol Herbst <kherbst@...hat.com>, Lyude
 Paul <lyude@...hat.com>, Danilo Krummrich <dakr@...nel.org>, Liviu Dudau
 <liviu.dudau@....com>, Lucas De Marchi <lucas.demarchi@...el.com>, Thomas
 Hellström <thomas.hellstrom@...ux.intel.com>, Rodrigo
 Vivi <rodrigo.vivi@...el.com>, dri-devel@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org, nouveau@...ts.freedesktop.org,
 intel-xe@...ts.freedesktop.org, asahi@...ts.linux.dev, Asahi Lina
 <lina@...hilina.net>
Subject: Re: [PATCH v4 1/7] drm/panthor: Add support for atomic page table
 updates

On Wed, 16 Jul 2025 16:43:24 +0100
Steven Price <steven.price@....com> wrote:

> >>> I think we need to briefly take vm->op_lock to ensure synchronisation
> >>> but that doesn't seem a big issue. Or perhaps there's a good reason that
> >>> I'm missing?  
> >>
> >> I think you're right, all other accesses to locked_region are guarded by
> >> op_lock. GPU job submit poke vm_active concurrently with vm_bind jobs doing
> >> region {,un}locks.  
> > Actually no, that's not necessary. Access to locked_region is protected by 
> > slots_lock, which is held here. Trying to lock vm->op_lock would also be 
> > detrimental here, because these locks are often taken together and slots_lock 
> > is taken after op_lock is taken, so taking op_lock here would be extremely 
> > deadlockful.  
> 
> It would obviously be necessary to acquire vm->op_lock before
> as.slots_lock as you say to avoid deadlocks. Note that as soon as
> as.slots_lock is held vm->op_lock can be dropped.

Yeah, lock ordering is not an issue, because we take slots_lock in this
function, so we're in full control of the ordering. And I wouldn't even
consider releasing op_lock as soon as we acquire slots_lock because

- that make things harder to reason about
- the locked section is not blocking on any sort of external event
- the locked section is pretty straightforward (so no excessive delays
expected here)

> 
> I just find the current approach a little odd, and unless there's a good
> reason for it would prefer that we don't enable a VM on a new address
> space while there's an outstanding vm_bind still running. Obviously if
> there's a good reason (e.g. we really do expect long running vm_bind
> operations) then that just need documenting in the commit message. But
> I'm not aware that's the case here.

I fully agree here. If there's no obvious reason to not serialize
vm_active() on VM bind ops, I'd opt for taking the VM op_lock and
calling it a day. And I honestly can't think of any:

- the VM op logic is all synchronous/non-blocking
- it's expected to be fast
- AS rotation is something I hope is not happening too often, otherwise
  we'll have other things to worry about (the whole CSG slot scheduling
  logic is quite involved, and I'd expect the
  BIND-while-making-AS-active to be rare enough that it becomes noise
  in the overall overhead of kernel-side GPU scheduling happening in
  Panthor)

> 
> Although in general I'm a bit wary of relying on the whole lock region
> feature - previous GPUs have an errata. But maybe I'm being over
> cautious there.

We're heavily relying on it already to allow updates of the VM while
the GPU is executing stuff. If that's problematic on v10+, I'd rather
know early :D.

Regards,

Boris