linux-kernel - Re: [RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z73nYKsq14Pf6ucp@cassiopeiae>
Date: Tue, 25 Feb 2025 16:53:04 +0100
From: Danilo Krummrich <dakr@...nel.org>
To: Alexandre Courbot <acourbot@...dia.com>
Cc: Dave Airlie <airlied@...il.com>, Gary Guo <gary@...yguo.net>,
	Joel Fernandes <joel@...lfernandes.org>,
	Boqun Feng <boqun.feng@...il.com>,
	John Hubbard <jhubbard@...dia.com>, Ben Skeggs <bskeggs@...dia.com>,
	linux-kernel@...r.kernel.org, rust-for-linux@...r.kernel.org,
	nouveau@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
	Nouveau <nouveau-bounces@...ts.freedesktop.org>
Subject: Re: [RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice
 implementation

On Wed, Feb 26, 2025 at 12:23:40AM +0900, Alexandre Courbot wrote:
> On Wed Feb 26, 2025 at 12:06 AM JST, Danilo Krummrich wrote:
> > On Tue, Feb 25, 2025 at 11:11:07PM +0900, Alexandre Courbot wrote:
> >> On Mon Feb 24, 2025 at 9:07 PM JST, Danilo Krummrich wrote:
> >> > CC: Gary
> >> >
> >> > On Mon, Feb 24, 2025 at 10:40:00AM +0900, Alexandre Courbot wrote:
> >> >> This inability to sleep while we are accessing registers seems very
> >> >> constraining to me, if not dangerous. It is pretty common to have
> >> >> functions intermingle hardware accesses with other operations that might
> >> >> sleep, and this constraint means that in such cases the caller would
> >> >> need to perform guard lifetime management manually:
> >> >> 
> >> >>   let bar_guard = bar.try_access()?;
> >> >>   /* do something non-sleeping with bar_guard */
> >> >>   drop(bar_guard);
> >> >> 
> >> >>   /* do something that might sleep */
> >> >> 
> >> >>   let bar_guard = bar.try_access()?;
> >> >>   /* do something non-sleeping with bar_guard */
> >> >>   drop(bar_guard);
> >> >> 
> >> >>   ...
> >> >> 
> >> >> Failure to drop the guard potentially introduces a race condition, which
> >> >> will receive no compile-time warning and potentialy not even a runtime
> >> >> one unless lockdep is enabled. This problem does not exist with the
> >> >> equivalent C code AFAICT, which makes the Rust version actually more
> >> >> error-prone and dangerous, the opposite of what we are trying to achieve
> >> >> with Rust. Or am I missing something?
> >> >
> >> > Generally you are right, but you have to see it from a different perspective.
> >> >
> >> > What you describe is not an issue that comes from the design of the API, but is
> >> > a limitation of Rust in the kernel. People are aware of the issue and with klint
> >> > [1] there are solutions for that in the pipeline, see also [2] and [3].
> >> >
> >> > [1] https://rust-for-linux.com/klint
> >> > [2] https://github.com/Rust-for-Linux/klint/blob/trunk/doc/atomic_context.md
> >> > [3] https://www.memorysafety.org/blog/gary-guo-klint-rust-tools/
> >> 
> >> Thanks, I wasn't aware of klint and it looks indeed cool, even if not perfect by
> >> its own admission. But even if the ignore the safety issue, the other one
> >> (ergonomics) is still there.
> >> 
> >> Basically this way of accessing registers imposes quite a mental burden on its
> >> users. It requires a very different (and harsher) discipline than when writing
> >> the same code in C
> >
> > We need similar solutions in C too, see drm_dev_enter() / drm_dev_exit() and
> > drm_dev_unplug().
> 
> Granted, but the use of these is much more coarsed-grained than what is
> expected of IO resources, right?

Potentially, yes. But exactly this characteristic has been criticised [1].

[1] https://lore.kernel.org/nouveau/Z7XVfnnrRKrtQbB6@phenom.ffwll.local/

> 
> >
> >> and the correct granularity to use is unclear to me.
> >> 
> >> For instance, if I want to do the equivalent of Nouveau's nvkm_usec() to poll a
> >> particular register in a busy loop, should I call try_access() once before the
> >> loop? Or every time before accessing the register?
> >
> > I think we should re-acquire the guard in each iteration and drop it before the
> > delay. I think a simple closure would work very well for this pattern?
> >
> >> I'm afraid having to check
> >> that the resource is still alive before accessing any register is going to
> >> become tedious very quickly.
> >> 
> >> I understand that we want to protect against accessing the IO region of an
> >> unplugged device ; but still there is no guarantee that the device won't be
> >> unplugged in the middle of a critical section, however short. Thus the driver
> >> code should be able to recognize that the device has fallen off the bus when it
> >> e.g. gets a bunch of 0xff instead of a valid value. So do we really need to
> >> extra protection that AFAICT isn't used in C?
> >
> > As mentioned above, we already do similar things in C.
> >
> > Also, think about what's the alternative. If we remove the Devres wrapper of
> > pci::Bar, we lose the control over the lifetime of the bar mapping and it can
> > easily out-live the device / driver binding. This makes the API unsound.
> 
> Oh my issue is not with the Devres wrapper, I think it makes sense -
> it's more the use of RCU to control access to the resource that I find
> too constraining. And I'm pretty sure there will be more users of the
> same opinion as more drivers using it get written.

What do you suggest?

> 
> >
> > With this drivers would be able to keep resources acquired. What if after a
> > hotplug the physical address region is re-used and to be mapped by another
> > driver?
> 
> Actually - wouldn't that issue also be addressed by a PCI equivalent to
> drm_dev_enter() and friends that ensures the device (and thus its
> devres resources) stay in place?

I'm not sure I get the idea, but we can *not* have the device resources stay in
place once the device is unbound (e.g. keep the resource region acquired by the
driver).

Consequently, we have to have a way to revoke access to the corresponding
pci::Bar.

> 
> Using Rust, I can imagine (but not picture precisely yet) some method of
> the device that returns a reference to an inner structure containing its
> resources, available with immediate access. Since it would be
> coarser-grained, it could rely on something less constraining than RCU
> without a noticeable performance penalty.

We had similar attempts when we designed this API, i.e. a common Revocable in
the driver private data of a device. But it had some chicken-egg issues on
initialization in probe(). Besides that, it wouldn't get you rid of the
Revocable, since the corresponding resource are only valid while the driver is
bound to a device, not for the entire lifetime of the device.