linux-kernel - Re: [RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <30d48cd600c1aab81d5495c13930af926ecc2380.camel@mailbox.org>
Date: Tue, 25 Nov 2025 14:20:52 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Daniel Almeida <daniel.almeida@...labora.com>, Philipp Stanner
	 <phasta@...nel.org>
Cc: Alice Ryhl <aliceryhl@...gle.com>, Danilo Krummrich <dakr@...nel.org>, 
 Christian König <ckoenig.leichtzumerken@...il.com>,
 Tvrtko Ursulin <tursulin@...ulin.net>,  Alexandre Courbot
 <acourbot@...dia.com>, Boris Brezillon <boris.brezillon@...labora.com>,
 Dave Airlie <airlied@...hat.com>, Lyude Paul <lyude@...hat.com>, Peter
 Colberg <pcolberg@...hat.com>, dri-devel@...ts.freedesktop.org, 
 linux-kernel@...r.kernel.org, rust-for-linux@...r.kernel.org
Subject: Re: [RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton

On Mon, 2025-11-24 at 10:58 -0300, Daniel Almeida wrote:
> Hi Phillip,
> 
> > On 18 Nov 2025, at 10:25, Philipp Stanner <phasta@...nel.org> wrote:
> > 
> > 

[…]

> > +use crate::{
> > +    prelude::*,
> > +    types::ARef,
> > +};
> > +use kernel::sync::{Arc, SpinLock, new_spinlock, DmaFence, DmaFenceCtx, DmaFenceCb, DmaFenceCbFunc};
> > +use kernel::list::*;
> > +use kernel::revocable::Revocable;
> > +
> > +
> > +#[pin_data]
> > +pub struct Job<T: ?Sized> {
> > +    credits: u32,
> > +//    dependencies: List, // TODO implement dependency list
> 
> I am assuming that this will be a list of callbacks?

That's supposed to become the list of DmaFence's which are to be
treated as dependencies of this job.

Only once all fences in this list are signaled the JQ will push that
job.

> 
> > +    #[pin]
> > +    data: T,
> > +}
> > +
> > +impl<T> Job<T> {
> > +    /// Create a new job that can be submitted to [`Jobqueue`].
> > +    ///
> > +    /// Jobs contain driver data that will later be made available to the driver's
> > +    /// run_job() callback in which the job gets pushed to the GPU.
> > +    pub fn new(credits: u32, data: impl PinInit<T>) -> Result<Pin<KBox<Self>>> {
> > +        let job = pin_init!(Self {
> > +            credits,
> > +            data <- data,
> > +        });
> > +
> > +        KBox::pin_init(job, GFP_KERNEL)
> > +    }
> > +
> > +    /// Add a callback to the job. When the job gets submitted, all added callbacks will be
> > +    /// registered on the [`DmaFence`] the jobqueue returns for that job.
> > +    pub fn add_callback() -> Result {
> 
> Can’t we take all the callbacks at submission time?

To clarify the terminology, a "callback" here would be callbacks which
the JQ shall register on the done_fence returned by
DmaFence::submit_job().

> > +        Ok(())
> > +    }
> > +
> > +    /// Add a [`DmaFence`] or a [`DoneFence`] as this job's dependency. The job
> > +    /// will only be executed after that dependency has been finished.
> > +    pub fn add_dependency() -> Result {
> 
> Which would let us remove this ^

It would allow for removing this function, but you'd then just have an
optional (some jobs have no dependencies) function parameter in
DmaFence::submit_job().

The current idea looks like this:

```
let jobq = JobQueue::new(…);
let job = Job::new(driver_data);

job.add_dependency(done_fence_of_shader_in_another_context); // optional
job.add_callback(cb_that_will_wake_userspace_or_sth); // optional

let done_fence = jobq.submit_job(job)?;
```

The JQ eats the job (ownership transfer), so by design you have to set
all dependencies and specify everything that shall be done when the job
finishes _before_ submitting the job.

I think an API in this form makes the order of events very obvious to
the user?


What happens then behind the scenes is that the JQ registers all the
callbacks on the done_fence returned above. I'm not super sure about
this design idea; it's certainly optional. However, it has the
advantage of freeing the JQ user from dealing with races of done_fence.

Otherwise one would have to do something like

```
let done_fence = jobq.submit_job(job)?;

let err = done_fence.register_callback(my_drivers_cb);
if err.was_race_and_is_already_signaled() {
execute_cb_code_myself_now();
}
```


> 
> > +        // TODO: Enqueue passed DmaFence into the job's dependency list.
> > +        Ok(())
> > +    }
> > +
> > +    /// Check if there are dependencies for this job. Register the jobqueue
> > +    /// waker if yes.
> > +    fn arm_deps() -> Result {
> 
> I wonder if “check_dependencies” would be a better name? Or something
> along these lines.

ACK.

> 
> > +        // TODO: Register DependencyWaker here if applicable.
> > +        Ok(())
> > +    }
> > +}
> > +
> > +// Dummy trait for the linked list.
> > +trait JobData {
> 
> > +    fn access_data(&self) -> i32;
> 
> Can’t we dereference to the data?

That's dummy code that only exists because I so far am failing with
even getting the basic List to work.

> 
> > +}
> > +
> > +#[pin_data]
> > +struct EnqueuedJob<T: ?Sized> {
> > +    inner: Pin<KBox<Job<T>>>,
> > +    #[pin]
> > +    links: ListLinksSelfPtr<EnqueuedJob<dyn JobData>>,
> 
> Why not a KVec? A queue type can hold a KVec of enqueued jobs, and this can
> hold an Arc of the queue type.

My understanding is that KVec is not intended to be the data structure
for this?

KVec is basically like a realloc() in C, an array of same sized
elements.

The JQ, hypothetically, can hold an infinite amount of members in its
waiting_list, only the running_list is limited by the credit count.


>  By extension, ensures that the queue does not
> die while we have enqueued jobs.

See below.

> 
> 
> > +    done_fence: ARef<DmaFence<i32>>, // i32 is just dummy data. TODO: allow for replacing with `()`
> > +    // The hardware_fence can by definition only be set at an unknown point in
> > +    // time.
> > +    // TODO: Think about replacing this with a `struct RunningJob` which consumes
> > +    // an `EnqueuedJob`.
> > +    hardware_fence: Option<ARef<DmaFence<i32>>>, // i32 is dummy data until there's DmaFence
> > +                                                 // without data.
> > +    nr_of_deps: u32,
> > +}
> > +
> > +impl<T> EnqueuedJob<T> {
> > +    fn new(inner: Pin<KBox<Job<T>>>, fctx: &Arc<DmaFenceCtx>) -> Result<ListArc<Self>> {
> > +        let pseudo_data: i32 = 42;
> > +        let done_fence = fctx.as_arc_borrow().new_fence(pseudo_data)?;
> > +
> > +        ListArc::pin_init(try_pin_init!(Self {
> > +            inner,
> > +            links <- ListLinksSelfPtr::new(),
> > +            done_fence,
> > +            hardware_fence: None,
> > +            nr_of_deps: 0, // TODO implement
> > +        }), GFP_KERNEL)
> > +    }
> > +}
> > +
> > +impl_list_arc_safe! {
> > +    impl{T: ?Sized} ListArcSafe<0> for EnqueuedJob<T> { untracked; }
> > +}
> > +
> > +impl_list_item! {
> > +    impl ListItem<0> for EnqueuedJob<dyn JobData> { using ListLinksSelfPtr { self.links }; }
> > +}
> > +
> > +// Callback item for the hardware fences to wake / progress the jobqueue.
> > +struct HwFenceWaker<T> {
> > +    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,
> 
> Instead of a Revocable, why not keep an Arc of InnerJobQueue (which should
> perhaps be called JobQueueInner)?
> 
> This way, the user can have this:
> 
> struct JobQueue(Arc<JobqueueInner>);
> 
> When the user drops the JobQueue, it will schedule whatever teardown
> operations,
> 

What kind of operation would that be? Completing all running_jobs?
Completing all waiting_jobs? Completing all running_jobs and canceling
all waiting_jobs? etc.


>  but the inner queue will not go out of scope, guaranteeing that
> there is no UAF at least at this level.
> 
> You can create circular references to keep the JobQueueInner alive for as long
> as the teardown operation is taking place:
> 
> struct SomeStructUsedForCleanup {
>   Arc<JobQueueInner> queue;
>   // ... more stuff
> }
> 
> struct JobQueueInner {
>  KVec<Arc<SomeStructUsedForCleanup>> cleanups;
> }
> 
> Given this cycle, both the queue and whatever structs you need for cleanup will
> remain alive indefinitely. At some point, once whatever cleanup completes, you
> can break the cycle:
> 
> impl Drop for SomeStructUsedForCleanup {
>   fn drop(...) {
>     self.queue.cleanups.remove(self)
>   }
> }
> 
> Once all the cleanups complete, the JobQueueInner will drop.

Whether your design approach has advantages depends on the above
question of what "cleanup" means to you?

> 
> Note that I'd expect this struct I “invented" to be a DmaFenceCb representing a
> pending dependency or a job that is already on the ring.
> 
> > +    job: ListArc<EnqueuedJob<T>>,
> > +}
> > 

[…]

> > +    fn update_capacity(&mut self, cost: u32) {
> > +        self.capacity -= cost;
> > +    }
> > +
> > +
> > +    // Called by the hw_fence callbacks, dependency callbacks, and submit_job().
> > +    // TODO: does submit_job() ever have to call it?
> 
> Hm, yeah, I’d say so.

Yup. That comment is a relict.

> 
> > +    fn start_submit_worker(&mut self) {
> > +        if self.submit_worker_active {
> > +            return;
> > +        }
> > +
> > +        // TODO run submit work item
> > +
> > +        self.submit_worker_active = true;
> > +    }
> > 

[…]

> > +    /// Submit a job to the jobqueue.
> > +    ///
> > +    /// The jobqueue takes ownership over the job and later passes it back to the
> > +    /// driver by reference through the driver's run_job callback. Jobs are
> > +    /// passed back by reference instead of by value partially to allow for later
> > +    /// adding a job resubmission mechanism to be added to [`Jobqueue`].
> > +    ///
> > +    /// Jobs get run and their done_fences get signalled in submission order.
> > +    ///
> > +    /// Returns the "done_fence" on success, which gets signalled once the
> > +    /// hardware has completed the job and once the jobqueue is done with a job.
> > +    pub fn submit_job<U>(&self, job: Pin<KBox<Job<U>>>) -> Result<ARef<DmaFence<i32>>> {
> > +        let job_cost = job.credits;
> > +        // TODO: It would be nice if the done_fence's seqno actually matches the
> > +        // submission order. To do that, however, we'd need to protect job
> > +        // creation with InnerJobqueue's spinlock. Is that worth it?
> 
> Can you guarantee that the seqno will not go backwards?

As pointed out in the other thread, that could currently happen if a
driver calls submit_job() with >1 thread.

IOW, *done_fence* seqnos could end up being enqueued like this

42 43 45 44 46

By taking the lock that could be prevented. However, that's only a
virtual or tiny win, because jobs could then actually be submitted in
an order not desired by the driver, but with correct done_fence seqno
order.

JQ executes jobs in the order they were submitted to. The fundamental
question is really: should the JQ care and what should it do if a
driver spams submit_job() asynchronously?

I tend to think that there is not really much we can do about that.


> > +impl Drop for Jobqueue {
> > +    fn drop(&mut self) {
> > +        // The hardware fences might outlive the jobqueue. So hw_fence callbacks
> > +        // could very well still call into job queue code, resulting in
> > +        // data UAF or, should the jobqueue code be unloaded, even code UAF.
> 
> Not if they reference JobQueueInner as I proposed above.
> 
> > +        //
> > +        // Thus, the jobqueue needs to be cleanly decoupled from the hardware
> > +        // fences when it drops, in other words, it needs to deregister all its
> > +        // hw_fence callbacks.
> > +        //
> > +        // This, however, could easily deadlock when a hw_fence signals:
> > +        //
> > +        // Step     |   Jobqueue step               |   hw_fence step
> > +        // ------------------------------------------------------------------
> > +        // 1        |   JQ starts drop              |   fence signals
> > +        // 2        |   JQ lock taken               |   fence lock taken
> > +        // 3        |   Tries to take fence lock    |   Tries to take JQ lock
> > +        // 4        |   ***DEADLOCK***              |   ***DEADLOCK***
> > +        //
> > +        // In order to prevent deadlock, we first have to revoke access to the
> > +        // JQ so that all fence callbacks can't try to take the lock anymore,
> > +        // and then deregister all JQ callbacks.
> > +        self.inner.revoke();
> > +
> > +        /*
> > +        let guard = self.inner.lock();
> > +        for job in self.inner.waiting_jobs {
> > +            job.deregister_dep_fences();
> > +        }
> > +        for job in self.inner.running_jobs {
> > +            job.deregister_hw_fence();
> > +        }
> > +        */
> 
> Under my proposal above, you can also wait on dependencies if you want: the
> drop() thread will not be blocked.

Maybe (I'd have to look deeper into the idea)

But what for? When someone drops his jobqueue, one would like to think
that he doesn't care about all pending jobs anymore anyways. So the
main thing you need to guarantee is that userspace gets unblocked by
signaling all fences.


Note that we had very similar discussions when solving the memory leaks
in drm_sched_fini(). The TL;DR of those discussions was:

 * Refcounting drm_sched so that it can outlive drm_sched_fini() means
   that it will continue calling into the driver with the driver
   callbacks -> UAF
 * Waiting could cause you to block SIGKILL
 * The sanest way to go was deemed to be to signal everything in the
   pending_list synchronously. Once you've done this, you know for sure
   that everything is done and clean.


AFAICS, your proposal might still have the problem of JQ continuously
calling into driver code?

I think the proposed solution is very clean: when you drop, decouple JQ
and driver by 100%, stop everything, tear everything down. At least
that's what drm_sched_fini() should have been from the beginning.


P.