linux-kernel - Re: [RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <94278c46a48207c3f71cd127daa82d69d7d869c2.camel@mailbox.org>
Date: Fri, 28 Nov 2025 11:07:26 +0100
From: Philipp Stanner <phasta@...lbox.org>
To: Daniel Almeida <daniel.almeida@...labora.com>, phasta@...nel.org
Cc: Alice Ryhl <aliceryhl@...gle.com>, Danilo Krummrich <dakr@...nel.org>, 
 Christian König <ckoenig.leichtzumerken@...il.com>,
 Tvrtko Ursulin <tursulin@...ulin.net>,  Alexandre Courbot
 <acourbot@...dia.com>, Boris Brezillon <boris.brezillon@...labora.com>,
 Dave Airlie <airlied@...hat.com>, Lyude Paul <lyude@...hat.com>, Peter
 Colberg <pcolberg@...hat.com>, dri-devel@...ts.freedesktop.org, 
 linux-kernel@...r.kernel.org, rust-for-linux@...r.kernel.org
Subject: Re: [RFC WIP 3/3] rust/drm: Add initial jobqueue sceleton

On Thu, 2025-11-27 at 11:13 -0300, Daniel Almeida wrote:
> 
> 
> > On 25 Nov 2025, at 10:20, Philipp Stanner <phasta@...lbox.org> wrote:
> > 
> > On Mon, 2025-11-24 at 10:58 -0300, Daniel Almeida wrote:
> > > Hi Phillip,
> > > 
> > > > On 18 Nov 2025, at 10:25, Philipp Stanner <phasta@...nel.org> wrote:
> > > > 
> > > > 
> > 
> > […]
> > 
> > > > +use crate::{
> > > > +    prelude::*,
> > > > +    types::ARef,
> > > > +};
> > > > +use kernel::sync::{Arc, SpinLock, new_spinlock, DmaFence, DmaFenceCtx, DmaFenceCb, DmaFenceCbFunc};
> > > > +use kernel::list::*;
> > > > +use kernel::revocable::Revocable;
> > > > +
> > > > +
> > > > +#[pin_data]
> > > > +pub struct Job<T: ?Sized> {
> > > > +    credits: u32,
> > > > +//    dependencies: List, // TODO implement dependency list
> > > 
> > > I am assuming that this will be a list of callbacks?
> > 
> > That's supposed to become the list of DmaFence's which are to be
> > treated as dependencies of this job.
> > 
> > Only once all fences in this list are signaled the JQ will push that
> > job.
> 
> Ok, I was approaching this from the current DRM scheduler design, which (IIRC)
> uses callbacks to represent dependencies.
> 

Depending on what "represent" means, it's the same here. A dependency
is some external fence, by definition one that doesn't come from the
same jobqueue / entity. To know when the dependency has been fulfilled,
you have to register one of your own events on the fence to wake
yourself.

>  IOW: if you managed to register a
> callback on a dependency, it means that it hasn’t signaled yet.

Yep.

> 
> In any case, that was just me trying to understand this better. IMHO, feel free
> to use anything you think it’s best here, like the whole DmaFence struct.

I think what might be confusing a bit is that we're used to drm/sched,
where we have entities and the scheduler instance itself (and
runqueues). Jobqueue is entity and "scheduler" in once piece. So some
functionality that we were used to see in sched_entity is now in
Jobqueue.

> 

[…]

> > .
> > 
> > > > +        Ok(())
> > > > +    }
> > > > +
> > > > +    /// Add a [`DmaFence`] or a [`DoneFence`] as this job's dependency. The job
> > > > +    /// will only be executed after that dependency has been finished.
> > > > +    pub fn add_dependency() -> Result {
> > > 
> > > Which would let us remove this ^
> > 
> > It would allow for removing this function, but you'd then just have an
> > optional (some jobs have no dependencies) function parameter in
> > DmaFence::submit_job().
> 
> > 
> > The current idea looks like this:
> > 
> > ```
> > let jobq = JobQueue::new(…);
> > let job = Job::new(driver_data);
> > 
> > job.add_dependency(done_fence_of_shader_in_another_context); // optional
> > job.add_callback(cb_that_will_wake_userspace_or_sth); // optional
> > 
> > let done_fence = jobq.submit_job(job)?;
> > ```
> > 
> > The JQ eats the job (ownership transfer), so by design you have to set
> > all dependencies and specify everything that shall be done when the job
> > finishes _before_ submitting the job.
> > 
> > I think an API in this form makes the order of events very obvious to
> > the user?
> > 
> 
> 
> You’d pass a
> 
> fn submit(…, dependencies: &[DmaFence], callbacks: &[Callback])
> 
> This way a user cannot submit a job without being explicit about dependencies
> and callbacks, i.e.: it cannot be forgotten, while still being optional. 

Hm. Yoah.
IDK, IMO semantically it's cleaner to have a job and methods that work
on the job, defining its characteristics and its consequences, and a
function that pushes jobs.

Your idea works obviously, and what you state might be an advantage. As
long as we're an RFC phase I think we can keep my draft for now and
gather more feedback from the others to see. But it's no big deal
either way probably

> 
> > What happens then behind the scenes is that the JQ registers all the
> > callbacks on the done_fence returned above. I'm not super sure about
> > this design idea; it's certainly optional. However, it has the
> > advantage of freeing the JQ user from dealing with races of done_fence.
> > 
> > Otherwise one would have to do something like
> > 
> > ```
> > let done_fence = jobq.submit_job(job)?;
> > 
> > let err = done_fence.register_callback(my_drivers_cb);
> > if err.was_race_and_is_already_signaled() {
> > execute_cb_code_myself_now();
> > }
> > ```
> > 
> > 
> > > 
> > > > +        // TODO: Enqueue passed DmaFence into the job's dependency list.
> > > > +        Ok(())
> > > > +    }
> > > > +
> > > > +    /// Check if there are dependencies for this job. Register the jobqueue
> > > > +    /// waker if yes.
> > > > +    fn arm_deps() -> Result {
> > > 
> > > I wonder if “check_dependencies” would be a better name? Or something
> > > along these lines.
> > 
> > ACK.
> > 
> > > 
> > > > +        // TODO: Register DependencyWaker here if applicable.
> > > > +        Ok(())
> > > > +    }
> > > > +}
> > > > +
> > > > +// Dummy trait for the linked list.
> > > > +trait JobData {
> > > 
> > > > +    fn access_data(&self) -> i32;
> > > 
> > > Can’t we dereference to the data?
> > 
> > That's dummy code that only exists because I so far am failing with
> > even getting the basic List to work.
> > 
> > > 
> > > > +}
> > > > +
> > > > +#[pin_data]
> > > > +struct EnqueuedJob<T: ?Sized> {
> > > > +    inner: Pin<KBox<Job<T>>>,
> > > > +    #[pin]
> > > > +    links: ListLinksSelfPtr<EnqueuedJob<dyn JobData>>,
> > > 
> > > Why not a KVec? A queue type can hold a KVec of enqueued jobs, and this can
> > > hold an Arc of the queue type.
> > 
> > My understanding is that KVec is not intended to be the data structure
> > for this?
> > 
> > KVec is basically like a realloc() in C, an array of same sized
> > elements.
> > 
> > The JQ, hypothetically, can hold an infinite amount of members in its
> > waiting_list, only the running_list is limited by the credit count.
> 
> Then I'd pre-allocate or realloc() as needed. You can reuse the empty slots, so
> there won't be a unbounded growth. realloc() also looks fine, because it will
> happen outside of the signaling path.
> 
> My point is that writing your own data structure adds complexity. Your call,
> though.

My impression of the kernel is that it's uncommon to use arrays like
that.

I think your idea is charming because LinkedList is so astoinishingly
complex (from my POV). I'm iterating over DmaFence right now. Once that
is done I want to pick up the List topic and am going to investigate
your proposal.

> 
> > 
> > 
> > > By extension, ensures that the queue does not
> > > die while we have enqueued jobs.
> > 
> > See below.
> > 
> > > 
> > > 
> > > > +    done_fence: ARef<DmaFence<i32>>, // i32 is just dummy data. TODO: allow for replacing with `()`
> > > > +    // The hardware_fence can by definition only be set at an unknown point in
> > > > +    // time.
> > > > +    // TODO: Think about replacing this with a `struct RunningJob` which consumes
> > > > +    // an `EnqueuedJob`.
> > > > +    hardware_fence: Option<ARef<DmaFence<i32>>>, // i32 is dummy data until there's DmaFence
> > > > +                                                 // without data.
> > > > +    nr_of_deps: u32,
> > > > +}
> > > > +
> > > > +impl<T> EnqueuedJob<T> {
> > > > +    fn new(inner: Pin<KBox<Job<T>>>, fctx: &Arc<DmaFenceCtx>) -> Result<ListArc<Self>> {
> > > > +        let pseudo_data: i32 = 42;
> > > > +        let done_fence = fctx.as_arc_borrow().new_fence(pseudo_data)?;
> > > > +
> > > > +        ListArc::pin_init(try_pin_init!(Self {
> > > > +            inner,
> > > > +            links <- ListLinksSelfPtr::new(),
> > > > +            done_fence,
> > > > +            hardware_fence: None,
> > > > +            nr_of_deps: 0, // TODO implement
> > > > +        }), GFP_KERNEL)
> > > > +    }
> > > > +}
> > > > +
> > > > +impl_list_arc_safe! {
> > > > +    impl{T: ?Sized} ListArcSafe<0> for EnqueuedJob<T> { untracked; }
> > > > +}
> > > > +
> > > > +impl_list_item! {
> > > > +    impl ListItem<0> for EnqueuedJob<dyn JobData> { using ListLinksSelfPtr { self.links }; }
> > > > +}
> > > > +
> > > > +// Callback item for the hardware fences to wake / progress the jobqueue.
> > > > +struct HwFenceWaker<T> {
> > > > +    jobq: Arc<Revocable<SpinLock<InnerJobqueue>>>,
> > > 
> > > Instead of a Revocable, why not keep an Arc of InnerJobQueue (which should
> > > perhaps be called JobQueueInner)?
> > > 
> > > This way, the user can have this:
> > > 
> > > struct JobQueue(Arc<JobqueueInner>);
> > > 
> > > When the user drops the JobQueue, it will schedule whatever teardown
> > > operations,
> > > 
> > 
> > What kind of operation would that be? Completing all running_jobs?
> > Completing all waiting_jobs? Completing all running_jobs and canceling
> > all waiting_jobs? etc.
> > 
> 
> The same as the current DRM scheduler, i.e.:
> 
> static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>    struct dma_fence_cb *cb)
> 
> My understanding is that the JobQueue will follow a similar pattern?

We have to be careful with being inspired by drm/sched. We can learn
from it and it contains a few good ideas probably, but there's a reason
why we write something new instead of putting Rust abstractions on top
:)

._.

The fundamental design problem spawning most of the other problems is
that drm/sched has two job lists in two data structures, and these two
have different life times: entities and pending_list (in sched itself).
Add GPU resets and teardown order on top and you've got an explosive
cocktail.

Jobqueue now, as stated above, can contain all job lists in itself,
partly thanks to the 1:1 relationship between hardware rings and
jobqueues.

I'm not even entirely sure, but I think that
drm_sched_entity_kill_jobs_work() exists mostly because of another
drastic design mistake: the existence of the free_job() callback. You
cannot call a driver callback from the driver's execution context (same
thread); mostly because of locking issues I think.

It also signals the s_fence, which we don't have in that form, we just
have the done_fence.


I think this shows well why Jobqueue is such a great simplification
compared to drm/sched. We don't have separate entities with separate
life times; we don't have jobs with hybrid ownership (created by the
driver, cleaned up by drm/sched) which make that entity kill mechanism
necessary. All we need to do when JQ gets killed (drop) is:

   1. Signal all still waiting or running jobs' done_fences with an
      error.
   2. Decouple JQ from the hardware_fences and dependency_fences.
   3. Stop / clean up the JQ's resources (so far only a work_item)

And other, quite complicated cleanup work from drm_sched such as
getting the refcounts right or free_job()ing the jobs is done
automatically for us by Rust.


> 
> > 
> > > 

[…]

> 
> > 
> > 
> > > > +impl Drop for Jobqueue {
> > > > +    fn drop(&mut self) {
> > > > +        // The hardware fences might outlive the jobqueue. So hw_fence callbacks
> > > > +        // could very well still call into job queue code, resulting in
> > > > +        // data UAF or, should the jobqueue code be unloaded, even code UAF.
> > > 
> > > Not if they reference JobQueueInner as I proposed above.
> > > 
> > > > +        //
> > > > +        // Thus, the jobqueue needs to be cleanly decoupled from the hardware
> > > > +        // fences when it drops, in other words, it needs to deregister all its
> > > > +        // hw_fence callbacks.
> > > > +        //
> > > > +        // This, however, could easily deadlock when a hw_fence signals:
> > > > +        //
> > > > +        // Step     |   Jobqueue step               |   hw_fence step
> > > > +        // ------------------------------------------------------------------
> > > > +        // 1        |   JQ starts drop              |   fence signals
> > > > +        // 2        |   JQ lock taken               |   fence lock taken
> > > > +        // 3        |   Tries to take fence lock    |   Tries to take JQ lock
> > > > +        // 4        |   ***DEADLOCK***              |   ***DEADLOCK***
> > > > +        //
> > > > +        // In order to prevent deadlock, we first have to revoke access to the
> > > > +        // JQ so that all fence callbacks can't try to take the lock anymore,
> > > > +        // and then deregister all JQ callbacks.
> > > > +        self.inner.revoke();
> > > > +
> > > > +        /*
> > > > +        let guard = self.inner.lock();
> > > > +        for job in self.inner.waiting_jobs {
> > > > +            job.deregister_dep_fences();
> > > > +        }
> > > > +        for job in self.inner.running_jobs {
> > > > +            job.deregister_hw_fence();
> > > > +        }
> > > > +        */
> > > 
> > > Under my proposal above, you can also wait on dependencies if you want: the
> > > drop() thread will not be blocked.
> > 
> > Maybe (I'd have to look deeper into the idea)
> > 
> > But what for? When someone drops his jobqueue, one would like to think
> > that he doesn't care about all pending jobs anymore anyways. So the
> > main thing you need to guarantee is that userspace gets unblocked by
> > signaling all fences.
> 
> I was basically trying to recreate this:
> 
> static void drm_sched_entity_kill_jobs_cb(struct dma_fence *f,
>    struct dma_fence_cb *cb)
> {
>  struct drm_sched_job *job = container_of(cb, struct drm_sched_job,
>  finish_cb);
>  unsigned long index;
> 
>  dma_fence_put(f);
> 
>  /* Wait for all dependencies to avoid data corruptions */
>  xa_for_each(&job->dependencies, index, f) {

We decouple also from the jobs' dependencies in drop(). As for other
parties which have jobs inside of this JQ as dependencies: they will
get unblocked by the done_fences getting signaled with -ECANCELED.

What we could think about is whether something like

jq.wait_to_become_idle(timeout);

could make sense. In drm/sched, such a proposal was mostly rejected
because all drivers have their own solutions for that currently AFAIK
(Nouveau has a waitqueue, but I think Nouveau actually doesn't care
about what happens to pending userspace jobs when drm_sched_fini() is
reached anyways, the waitqueue exists for page table cleanup work).

Also, Christian pointed out that it could block SIGKILL when the
hardware is really slow. But an interruptible / timeoutable version
could make sense, I think?

It'll depend a bit on when a Rust driver really drop()s its JQ. In
principle, the driver should signal all hw_fences before drop()ing it.


> 
> > 
> > 
> > Note that we had very similar discussions when solving the memory leaks
> > in drm_sched_fini(). The TL;DR of those discussions was:
> > 
> > * Refcounting drm_sched so that it can outlive drm_sched_fini() means
> >   that it will continue calling into the driver with the driver
> >   callbacks -> UAF
> > * Waiting could cause you to block SIGKILL
> > * The sanest way to go was deemed to be to signal everything in the
> >   pending_list synchronously. Once you've done this, you know for sure
> >   that everything is done and clean.
> > 
> > 
> > AFAICS, your proposal might still have the problem of JQ continuously
> > calling into driver code?
> 
> You’re basically calling wait() and signal(), but not run(). On top of
> that, I don’t think the callbacks can actually reach the driver without
> taking an extra refcount on some driver structure (iow:  we should require the
> callbacks to be ’static). So, IIUC, no, this would not call into the
> driver.

Rust's safety guarantees are limited AFAIU because the language doesn't
understand that 'static stuff can be unloaded in the kernel.

So if the driver got unloaded but the JQ lived on, JQ might at least
call an unloaded TEXT segment / unloaded code. Or am I mistaken?


Anyways, I believe that synchronous solutions are always to be
preferred, because they just are simpler, provide a fix synchronization
point and it's easier for programmers to wrap their head around them.

IMO we should only make that async if it's necessary. But currently,
drop() would wait just for an RCU grace period (the revoke()) and then
signal all done_fences. As far as I understand the dma_fence contract,
no one must register anything blocking on dma_fences, so that should be
fine.

But correct me if I'm wrong. dma_fences are very complicated..


Greetings,
P.