lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAH5fLgharxhAHrP6OFZxXrWKSTsMp=vY5sGvUKzca3yhRJEW7A@mail.gmail.com>
Date: Thu, 12 Jun 2025 10:15:38 +0200
From: Alice Ryhl <aliceryhl@...gle.com>
To: Benno Lossin <lossin@...nel.org>
Cc: Danilo Krummrich <dakr@...nel.org>, gregkh@...uxfoundation.org, rafael@...nel.org, 
	ojeda@...nel.org, alex.gaynor@...il.com, boqun.feng@...il.com, 
	gary@...yguo.net, bjorn3_gh@...tonmail.com, benno.lossin@...ton.me, 
	a.hindborg@...nel.org, tmgross@...ch.edu, chrisi.schrefl@...il.com, 
	rust-for-linux@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/3] rust: devres: fix race in Devres::drop()

On Thu, Jun 12, 2025 at 10:13 AM Benno Lossin <lossin@...nel.org> wrote:
>
> On Tue Jun 3, 2025 at 10:48 PM CEST, Danilo Krummrich wrote:
> > In Devres::drop() we first remove the devres action and then drop the
> > wrapped device resource.
> >
> > The design goal is to give the owner of a Devres object control over when
> > the device resource is dropped, but limit the overall scope to the
> > corresponding device being bound to a driver.
> >
> > However, there's a race that was introduced with commit 8ff656643d30
> > ("rust: devres: remove action in `Devres::drop`"), but also has been
> > (partially) present from the initial version on.
> >
> > In Devres::drop(), the devres action is removed successfully and
> > subsequently the destructor of the wrapped device resource runs.
> > However, there is no guarantee that the destructor of the wrapped device
> > resource completes before the driver core is done unbinding the
> > corresponding device.
> >
> > If in Devres::drop(), the devres action can't be removed, it means that
> > the devres callback has been executed already, or is still running
> > concurrently. In case of the latter, either Devres::drop() wins revoking
> > the Revocable or the devres callback wins revoking the Revocable. If
> > Devres::drop() wins, we (again) have no guarantee that the destructor of
> > the wrapped device resource completes before the driver core is done
> > unbinding the corresponding device.
>
> I don't understand the exact sequence of events here. Here is what I got
> from your explanation:
>
> * the driver created a `Devres<T>` associated to their device.
> * their physical device gets disconnected and thus the driver core
>   starts unbinding the device.
> * simultaneously, the driver drops the `Devres<T>` (eg because the
>   driver initiated the physical removal)
> * now `devres_callback` is being called from both `Devres::Drop` (which
>   calls `Devres::remove_action`) and from the driver core.
> * they both call `inner.data.revoke()`, but only one wins, in our
>   example `Devres::drop`.
> * but now the driver core has finished running `devres_callback` and
>   finalizes unbinding the device, even though the `Devres` still exists
>   though is almost done being dropped.
>
> I don't see a race here. Also the `dev: ARef<Device>` should keep the
> device alive until the `Devres` is dropped, no?

The race is that Devres is used when the contents *must* be dropped
before the device is unbound. This example violates that by having
device unbind finish before the contents are dropped.

> > Depending on the specific device resource, this can potentially lead to
> > user-after-free bugs.
> >
> > In order to fix this, implement the following logic.
> >
> > In the devres callback, we're always good when we get to revoke the
> > device resource ourselves, i.e. Revocable::revoke() returns true.
> >
> > If Revocable::revoke() returns false, it means that Devres::drop(),
> > concurrently, already drops the device resource and we have to wait for
> > Devres::drop() to signal that it finished dropping the device resource.
> >
> > Note that if we hit the case where we need to wait for the completion of
> > Devres::drop() in the devres callback, it means that we're actually
> > racing with a concurrent Devres::drop() call, which already started
> > revoking the device resource for us. This is rather unlikely and means
> > that the concurrent Devres::drop() already started doing our work and we
> > just need to wait for it to complete it for us. Hence, there should not
> > be any additional overhead from that.
> >
> > (Actually, for now it's even better if Devres::drop() does the work for
> > us, since it can bypass the synchronize_rcu() call implied by
> > Revocable::revoke(), but this goes away anyways once I get to implement
> > the split devres callback approach, which allows us to first flip the
> > atomics of all registered Devres objects of a certain device, execute a
> > single synchronize_rcu() and then drop all revocable objects.)
> >
> > In Devres::drop() we try to revoke the device resource. If that is *not*
> > successful, it means that the devres callback already did and we're good.
> >
> > Otherwise, we try to remove the devres action, which, if successful,
> > means that we're good, since the device resource has just been revoked
> > by us *before* we removed the devres action successfully.
> >
> > If the devres action could not be removed, it means that the devres
> > callback must be running concurrently, hence we signal that the device
> > resource has been revoked by us, using the completion.
> >
> > This makes it safe to drop a Devres object from any task and at any point
> > of time, which is one of the design goals.
> >
> > Fixes: 8ff656643d30 ("rust: devres: remove action in `Devres::drop`") [1]
> > Reported-by: Alice Ryhl <aliceryhl@...gle.com>
> > Closes: https://lore.kernel.org/lkml/aD64YNuqbPPZHAa5@google.com/
> > Signed-off-by: Danilo Krummrich <dakr@...nel.org>
> > ---
> >  rust/kernel/devres.rs | 33 ++++++++++++++++++++++++++-------
> >  1 file changed, 26 insertions(+), 7 deletions(-)
>
> > @@ -161,7 +166,12 @@ fn remove_action(this: &Arc<Self>) {
> >          //         `DevresInner::new`.
> >          let inner = unsafe { Arc::from_raw(ptr) };
> >
> > -        inner.data.revoke();
> > +        if !inner.data.revoke() {
> > +            // If `revoke()` returns false, it means that `Devres::drop` already started revoking
> > +            // `inner.data` for us. Hence we have to wait until `Devres::drop()` signals that it
> > +            // completed revoking `inner.data`.
> > +            inner.revoke.wait_for_completion();
> > +        }
> >      }
> >  }
> >
> > @@ -232,6 +242,15 @@ fn deref(&self) -> &Self::Target {
> >
> >  impl<T> Drop for Devres<T> {
> >      fn drop(&mut self) {
> > -        DevresInner::remove_action(&self.0);
> > +        // SAFETY: When `drop` runs, it is guaranteed that nobody is accessing the revocable data
> > +        // anymore, hence it is safe not to wait for the grace period to finish.
> > +        if unsafe { self.revoke_nosync() } {
> > +            // We revoked `self.0.data` before the devres action did, hence try to remove it.
> > +            if !DevresInner::remove_action(&self.0) {
>
> Shouldn't this not be inverted? (ie 's/!//')
>
> Otherwise this will return `true`, get negated and we don't run the code
> below and the `inner.data.revoke()` in `devres_callback` will return
> `false` which will get negated and thus it will never return.
>
> ---
> Cheers,
> Benno
>
> > +                // We could not remove the devres action, which means that it now runs concurrently,
> > +                // hence signal that `self.0.data` has been revoked successfully.
> > +                self.0.revoke.complete_all();
> > +            }
> > +        }
> >      }
> >  }
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ