linux-kernel - Re: [RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z8bOjiQ3KXw-CkPo@pollux>
Date: Tue, 4 Mar 2025 10:57:34 +0100
From: Danilo Krummrich <dakr@...nel.org>
To: Jason Gunthorpe <jgg@...dia.com>
Cc: Boqun Feng <boqun.feng@...il.com>,
	Joel Fernandes <joelagnelf@...dia.com>,
	Alexandre Courbot <acourbot@...dia.com>,
	Dave Airlie <airlied@...il.com>, Gary Guo <gary@...yguo.net>,
	Joel Fernandes <joel@...lfernandes.org>,
	John Hubbard <jhubbard@...dia.com>, Ben Skeggs <bskeggs@...dia.com>,
	linux-kernel@...r.kernel.org, rust-for-linux@...r.kernel.org,
	nouveau@...ts.freedesktop.org, dri-devel@...ts.freedesktop.org,
	paulmck@...nel.org
Subject: Re: [RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice
 implementation

On Mon, Mar 03, 2025 at 05:50:02PM -0400, Jason Gunthorpe wrote:
> On Mon, Mar 03, 2025 at 08:36:34PM +0100, Danilo Krummrich wrote:
> > > > And yes, for *device resources* it is unsound if we do not ensure that the
> > > > device resource is actually dropped at device unbind.
> > > 
> > > Why not do a runtime validation then?
> > > 
> > > It would be easy to have an atomic counting how many devres objects
> > > are still alive.
> > 
> > (1) It would not be easy at all, if not impossible.
> > 
> > A Devres object doesn't know whether it's embedded in an Arc<Devres>, nor does
> > it know whether it is embedded in subsequent Arc containers, e.g.
> > Arc<Arc<Devres>>.
> 
> You aren't tracking that. If Rust has a problem, implement it in C:

Rust does not have a problem here. Arc is implemented as:

	pub struct Arc<T: ?Sized> {
	    ptr: NonNull<ArcInner<T>>,
	    _p: PhantomData<ArcInner<T>>,
	}

	#[repr(C)]
	struct ArcInner<T: ?Sized> {
	    refcount: Opaque<bindings::refcount_t>,
	    data: T,
	}

Which simply means that 'data' can't access its reference count.

You did identify this as a problem, since you think the correct solution is to
WARN() if an object is held alive across a certain boundary by previously taken
references. Where the actual solution (in Rust) would be to design things in a
way that this isn't possible in the first place.

Let me also point out that the current solution in Rust is perfectly aligned
with what you typically do in C conceptually.

In C you would call

	devm_request_region()
	devm_ioremap()

and you would store the resulting pointer in some driver private structure.

This structure may be reference counted and may outlive remove(). However, if
that's the case it doesn't prevent that the memory region is freed and the
memory mapping is unmapped by the corresponding devres callbacks.

We're following the exact same semantics in Rust, except that we don't leave the
driver private structure with an invalid pointer.

In contrast you keep proposing to turn that into some kind of busy loop in
remove(). If you think that's better, why don't you try to introduce this design
in C first?

> 
>   devm_rsgc_start(struct device *)
>   devm_rsgc_pci_iomap(struct device *,..)
>   devm_rsgc_pci_iounmap(struct device *,..)
>   devm_rsgc_fence(struct device *)
> 
> Surely that isn't a problem for bindings?

Wrapping those functions? Yes, that's not a problem in general.

But obviously I don't know what they're doing, i.e. how is devm_rsgc_pci_iomap()
different from pcim_iomap()?

I also don't know how you would encode them into an abstraction and how this
abstraction fits into the device / driver lifecycle.

Neither can I guess that from four function names, nor can I tell you whether
that results in a safe, sound and reasonable API.

And given that you don't seem to know about at least some Rust fundamentals
(i.e. how Arc works), I highly doubt that you can predict that either without
coding it up entirely, can you?

I explained the reasons for going with the current upstream design in detail,
and I'm happy to explain further if there are more questions.

But, if you just want to change the existing design, this isn't the place.
Instead, please follow the usual process, i.e. write up some patches, explain
how they improve the existing code and then we can discuss them.

> > The practical problem: Buggy drivers could (as you propose) stall the
> > corresponding task forever, never releasing the device resource. 
> 
> Should't be never. Rust never leaks memory? It will eventually be
> unstuck when Rust frees the memory in the concurrent context that is
> holding it.

We're passing Arc objects as ForeignOwnable to C code. If the C code leaks
memory, the corresponding memory allocated in Rust leaks too.

For the definition of "safe" in Rust we assume that C code used underneath is
never buggy, but the reality is different.

> 
> > Not releasing the device resource may stall subsequent drivers
> > trying to probe the device, or, if the physical memory region has
> > been reassigned to another device, prevent another device from
> > probing. This is *not* what I would call "function safely".
> 
> It didn't create a security problem.
> 
> The driver could also write 
> 
>  while (true)
>     sleep(1)

Seriously? Are you really arguing "My potentially infinite loop is not a problem
because a driver could also write the above"?

> In the remove function and achieve all of the above bad things, should
> you try to statically prevent that too?

Absolutely.