linux-kernel - Re: [PATCH 05/10] gpu: nova-core: gsp: Add GSP command queue handling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DCKUEPRW2VR8.26NJRPDYG2DFK@nvidia.com>
Date: Fri, 05 Sep 2025 20:50:36 +0900
From: "Alexandre Courbot" <acourbot@...dia.com>
To: "Alistair Popple" <apopple@...dia.com>,
 <dri-devel@...ts.freedesktop.org>, <dakr@...nel.org>
Cc: "Miguel Ojeda" <ojeda@...nel.org>, "Alex Gaynor"
 <alex.gaynor@...il.com>, "Boqun Feng" <boqun.feng@...il.com>, "Gary Guo"
 <gary@...yguo.net>, Björn Roy Baron
 <bjorn3_gh@...tonmail.com>, "Benno Lossin" <lossin@...nel.org>, "Andreas
 Hindborg" <a.hindborg@...nel.org>, "Alice Ryhl" <aliceryhl@...gle.com>,
 "Trevor Gross" <tmgross@...ch.edu>, "David Airlie" <airlied@...il.com>,
 "Simona Vetter" <simona@...ll.ch>, "Maarten Lankhorst"
 <maarten.lankhorst@...ux.intel.com>, "Maxime Ripard" <mripard@...nel.org>,
 "Thomas Zimmermann" <tzimmermann@...e.de>, "John Hubbard"
 <jhubbard@...dia.com>, "Joel Fernandes" <joelagnelf@...dia.com>, "Timur
 Tabi" <ttabi@...dia.com>, <linux-kernel@...r.kernel.org>,
 <nouveau@...ts.freedesktop.org>, "Nouveau"
 <nouveau-bounces@...ts.freedesktop.org>
Subject: Re: [PATCH 05/10] gpu: nova-core: gsp: Add GSP command queue
 handling

Hi Alistair,

Here is a second pass on things not directly related to bindings.

One general comment is that we will want more documentation about how
the command queue operates; without it it is a bit difficult to
understand how things run and who can read or write what. I hope we can
improve the last point through the introduction of a few more
types/abstraction.

Another thing we need to do is reduce the number of unsafe statements
and justify them more strongly. I've seen 22 new unsafe statements in
this patch that are not related to the implementation of FromBytes or
AsBytes. I'm pretty sure that with the right abstractions in place we
can remove most of them, and feel more confident about those that
remain.

On Wed Aug 27, 2025 at 5:20 PM JST, Alistair Popple wrote:
<snip>
> +/// Number of GSP pages making the Msgq.
> +const MSGQ_NUM_PAGES: usize = 0x3f;
> +
> +#[repr(C, align(0x1000))]
> +#[derive(Debug)]
> +struct MsgqData {
> +    data: [[u8; GSP_PAGE_SIZE]; MSGQ_NUM_PAGES],
> +}
> +
> +// Annoyingly there is no real equivalent of #define so we're forced to use a
> +// literal to specify the alignment above. So check that against the actual GSP
> +// page size here.
> +static_assert!(align_of::<MsgqData>() == GSP_PAGE_SIZE);
> +
> +// There is no struct defined for this in the open-gpu-kernel-source headers.
> +// Instead it is defined by code in GspMsgQueuesInit().
> +#[repr(C)]
> +#[derive(Debug)]
> +struct Msgq {
> +    tx: MsgqTxHeader,
> +    rx: MsgqRxHeader,
> +    msgq: MsgqData,
> +}
> +
> +#[repr(C)]
> +#[derive(Debug)]
> +struct GspMem {
> +    ptes: [u8; GSP_PAGE_SIZE],

This member appeared unused, but then I understand that it is filled by
`create_pte_array`. I'd suggest to change that function so it can
operate directly on `pte`, and also introduce a similar member to the
buffer types that also use it so we know what data this operates on.

> +    cpuq: Msgq,
> +    gspq: Msgq,
> +}

This is one of the structures that would benefit from being more
documented. :) For instance, the usage of `cpuq` and `gspq` was a bit
different from what I expected.

> +
> +// SAFETY: These structs don't meet the no-padding requirements of AsBytes but
> +// that is not a problem because they are not used outside the kernel.
> +unsafe impl AsBytes for GspMem {}
> +
> +// SAFETY: These structs don't meet the no-padding requirements of FromBytes but
> +// that is not a problem because they are not used outside the kernel.
> +unsafe impl FromBytes for GspMem {}
> +
> +pub(crate) struct GspCmdq {
> +    dev: ARef<device::Device>,
> +    msg_count: u32,
> +    seq: u32,

Here as well, some doccomments would be useful. I still don't understand
what purpose `seq` is supposed to fulfil.

> +    gsp_mem: CoherentAllocation<GspMem>,
> +    pub _nr_ptes: u32,
> +}
> +
> +// A reference to a message currently sitting in the GSP command queue. May
> +// contain two slices as the command queue is a circular buffer which may have
> +// wrapped.
> +//
> +// INVARIANT: The underlying message data cannot change because the struct holds
> +// a reference to the command queue which prevents command queue manipulation
> +// until the GspQueueMessage is dropped.
> +pub(crate) struct GspQueueMessage<'a> {
> +    cmdq: &'a mut GspCmdq,
> +    rpc_header: &'a GspRpcHeader,
> +    slice_1: &'a [u8],
> +    slice_2: Option<&'a [u8]>,
> +}
> +
> +type GspQueueMessageData<'a, M> = (&'a M, Option<SBuffer<core::array::IntoIter<&'a [u8], 2>>>);
> +
> +impl<'a> GspQueueMessage<'a> {
> +    #[expect(unused)]
> +    pub(crate) fn try_as<M: GspMessageFromGsp>(&'a self) -> Result<GspQueueMessageData<'a, M>> {
> +        if self.rpc_header.function != M::FUNCTION {
> +            return Err(ERANGE);
> +        }
> +
> +        // SAFETY: The slice references the cmdq message memory which is
> +        // guaranteed to outlive the returned GspQueueMessageData by the
> +        // invariants of GspQueueMessage and the lifetime 'a.
> +        let msg = unsafe { &*(self.slice_1.as_ptr().cast::<M>()) };
> +        let data = &self.slice_1[size_of::<M>()..];
> +        let data_size =
> +            self.rpc_header.length as usize - size_of::<GspRpcHeader>() - size_of::<M>();
> +        let sbuf = if data_size > 0 {
> +            Some(SBuffer::new_reader([data, self.slice_2.unwrap_or(&[])]))
> +        } else {
> +            None
> +        };
> +
> +        Ok((msg, sbuf))
> +    }

For the message queue I think this approach mostly work (we might also
want to add a way to match against the messages enum once we have it -
this can be a future patch though).

> +
> +    #[expect(unused)]
> +    pub(crate) fn ack(self) -> Result {
> +        self.cmdq.ack_msg(self.rpc_header.length)?;
> +
> +        Ok(())
> +    }
> +}
> +
> +// The same as GspQueueMessage except the fields are mutable for constructing a
> +// message to the GSP.
> +pub(crate) struct GspQueueCommand<'a> {
> +    cmdq: &'a mut GspCmdq,
> +    msg_header: &'a mut GspMsgHeader,
> +    rpc_header: &'a mut GspRpcHeader,
> +    slice_1: &'a mut [u8],
> +    slice_2: &'a mut [u8],
> +}
> +
> +type GspQueueCommandData<'a, M> = (
> +    &'a mut M,
> +    Option<SBuffer<core::array::IntoIter<&'a mut [u8], 2>>>,
> +);
> +
> +impl<'a> GspQueueCommand<'a> {
> +    #[expect(unused)]
> +    pub(crate) fn try_as<'b, M: GspCommandToGsp>(&'b mut self) -> GspQueueCommandData<'b, M> {
> +        // SAFETY: The slice references the cmdq message memory which is
> +        // guaranteed to outlive the returned GspQueueCommandData by the
> +        // invariants of GspQueueCommand and the lifetime 'a.
> +        let msg = unsafe { &mut *(self.slice_1.as_mut_ptr().cast::<M>()) };
> +        let data = &mut self.slice_1[size_of::<M>()..];
> +        let data_size =
> +            self.rpc_header.length as usize - size_of::<GspRpcHeader>() - size_of::<M>();
> +        let sbuf = if data_size > 0 {
> +            Some(SBuffer::new_writer([data, self.slice_2]))
> +        } else {
> +            None
> +        };
> +        self.rpc_header.function = M::FUNCTION;
> +
> +        (msg, sbuf)
> +    }
> +
> +    #[expect(unused)]
> +    pub(crate) fn send_to_gsp(self, bar: &Bar0) -> Result {
> +        self.cmdq.wait_for_free_cmd_to_gsp(
> +            Delta::from_secs(GSP_COMMAND_TIMEOUT),
> +            self.rpc_header.length as usize + size_of::<GspMsgHeader>(),
> +        )?;
> +        GspCmdq::send_cmd_to_gsp(self, bar)?;
> +        Ok(())
> +    }
> +}
> +
> +impl GspCmdq {
> +    pub(crate) fn new(dev: &device::Device<device::Bound>) -> Result<GspCmdq> {
> +        let mut gsp_mem =
> +            CoherentAllocation::<GspMem>::alloc_coherent(dev, 1, GFP_KERNEL | __GFP_ZERO)?;
> +
> +        let nr_ptes = size_of::<GspMem>() >> GSP_PAGE_SHIFT;
> +        build_assert!(nr_ptes * size_of::<u64>() <= GSP_PAGE_SIZE);
> +
> +        create_pte_array(&mut gsp_mem, 0);
> +
> +        const MSGQ_SIZE: u32 = size_of::<Msgq>() as u32;
> +        const MSG_COUNT: u32 = ((MSGQ_SIZE as usize - GSP_PAGE_SIZE) / GSP_PAGE_SIZE) as u32;
> +        const RX_HDR_OFF: u32 = offset_of!(Msgq, rx) as u32;
> +        dma_write!(
> +            gsp_mem[0].cpuq.tx = MsgqTxHeader {
> +                version: 0,
> +                size: MSGQ_SIZE,
> +                entry_off: GSP_PAGE_SIZE as u32,
> +                msg_size: GSP_PAGE_SIZE as u32,
> +                msg_count: MSG_COUNT,
> +                write_ptr: 0,
> +                flags: 1,
> +                rx_hdr_off: RX_HDR_OFF,
> +            }
> +        )?;

I don't see `gspq` being initialized, is this on purpose? If it is all
zeroes, how does the GSP know its size? Or does it assume the same
parameters are the `cpuq`?

> +
> +        Ok(GspCmdq {
> +            dev: dev.into(),
> +            msg_count: MSG_COUNT,
> +            seq: 0,
> +            gsp_mem,
> +            _nr_ptes: nr_ptes as u32,
> +        })
> +    }
> +
> +    fn cpu_wptr(&self) -> u32 {
> +        // SAFETY: index `0` is valid as `gsp_mem` has been allocated accordingly, thus the access
> +        // cannot fail.
> +        unsafe { dma_read!(self.gsp_mem[0].cpuq.tx.write_ptr).unwrap_unchecked() }
> +    }
> +
> +    fn gsp_rptr(&self) -> u32 {
> +        // SAFETY: index `0` is valid as `gsp_mem` has been allocated accordingly, thus the access
> +        // cannot fail.
> +        unsafe { dma_read!(self.gsp_mem[0].gspq.rx.read_ptr).unwrap_unchecked() }
> +    }
> +
> +    fn cpu_rptr(&self) -> u32 {
> +        // SAFETY: index `0` is valid as `gsp_mem` has been allocated accordingly, thus the access
> +        // cannot fail.
> +        unsafe { dma_read!(self.gsp_mem[0].cpuq.rx.read_ptr).unwrap_unchecked() }
> +    }
> +
> +    fn gsp_wptr(&self) -> u32 {
> +        // SAFETY: index `0` is valid as `gsp_mem` has been allocated accordingly, thus the access
> +        // cannot fail.
> +        unsafe { dma_read!(self.gsp_mem[0].gspq.tx.write_ptr).unwrap_unchecked() }
> +    }

Here is an easy trick to reduce the number of unsafe statements: have a
method that returns a reference to the `gsp_mem` (which contains the
unsafe part), and have these 4 methods call into it. And voilà, 3
unsafes gone. :)

> +
> +    // Returns the numbers of pages free for sending an RPC to GSP.
> +    fn free_tx_pages(&self) -> u32 {
> +        let wptr = self.cpu_wptr();
> +        let rptr = self.gsp_rptr();
> +        let mut free = rptr + self.msg_count - wptr - 1;
> +
> +        if free >= self.msg_count {
> +            free -= self.msg_count;
> +        }
> +
> +        free
> +    }
> +
> +    // Returns the number of pages the GSP has written to the queue.
> +    fn used_rx_pages(&self) -> u32 {
> +        let rptr = self.cpu_rptr();
> +        let wptr = self.gsp_wptr();
> +        let mut used = wptr + self.msg_count - rptr;
> +        if used >= self.msg_count {
> +            used -= self.msg_count;
> +        }
> +
> +        used
> +    }
> +
> +    fn calculate_checksum<T: Iterator<Item = u8>>(it: T) -> u32 {
> +        let sum64 = it
> +            .enumerate()
> +            .map(|(idx, byte)| (((idx % 8) * 8) as u32, byte))
> +            .fold(0, |acc, (rol, byte)| acc ^ u64::from(byte).rotate_left(rol));
> +
> +        ((sum64 >> 32) as u32) ^ (sum64 as u32)
> +    }
> +
> +    pub(crate) fn wait_for_free_cmd_to_gsp(&self, timeout: Delta, size: usize) -> Result {
> +        wait_on(timeout, || {
> +            if self.free_tx_pages() < size.div_ceil(GSP_PAGE_SIZE) as u32 {
> +                None
> +            } else {
> +                Some(())
> +            }
> +        })
> +    }
> +
> +    #[expect(unused)]
> +    pub(crate) fn alloc_gsp_queue_command<'a>(
> +        &'a mut self,
> +        cmd_size: usize,
> +    ) -> Result<GspQueueCommand<'a>> {

The command submission appears to be modeled after the message
reception, but I don't think this works very well in this case.

First, the returned `GspQueueCommand` does not have its function set
until `try_as` is called. But meanwhile it can very well be given to
`send_cmd_to_gsp`, skipping an important step in the submission process.
So at the very least I think we should merge both functions, as it
doesn't make sense to have one step without the other anyway (and they
are always called one after the other).

But maybe we can be even more radical and do the command building and
submission in a single function, and remove `GspQueueCommand`
altogether. What I have in mind is a function that reserves the required
space in the queue, creates the required mutable reference to the
command and (optionally) the writable SBuffer, and then call the passed
closure to initialize these two before sending the command. It would be
invoked something like this:

    send_cmd::<CommandType>(|command, sbuffer| {
      // initialize `command` and write required data into `sbuffer`
    }, bar)?;

This would ensure that no step is left undone, and also match how users
will use the command queue anyway, as I don't think we want to reserve
space for a command we are not intent on sending. :)

> +        const HEADER_SIZE: usize = size_of::<GspMsgHeader>() + size_of::<GspRpcHeader>();
> +        let msg_size = size_of::<GspMsgHeader>() + size_of::<GspRpcHeader>() + cmd_size;
> +        if self.free_tx_pages() < msg_size.div_ceil(GSP_PAGE_SIZE) as u32 {
> +            return Err(EAGAIN);
> +        }
> +        let wptr = self.cpu_wptr() as usize;
> +
> +        // SAFETY: By the invariants of CoherentAllocation gsp_mem.start_ptr_mut() is valid.
> +        let ptr = unsafe {
> +            core::ptr::addr_of_mut!((*self.gsp_mem.start_ptr_mut()).cpuq.msgq.data[wptr])
> +        };
> +
> +        // SAFETY: ptr points to at least one GSP_PAGE_SIZE bytes of contiguous
> +        // memory which is larger than GspMsgHeader.
> +        let msg_header_slice: &mut [u8] =
> +            unsafe { core::slice::from_raw_parts_mut(ptr.cast::<u8>(), size_of::<GspMsgHeader>()) };
> +        msg_header_slice.fill(0);
> +        let msg_header = GspMsgHeader::from_bytes_mut(msg_header_slice).ok_or(EINVAL)?;
> +        msg_header.auth_tag_buffer = [0; 16];
> +        msg_header.aad_buffer = [0; 16];
> +        msg_header.checksum = 0;
> +        msg_header.sequence = self.seq;
> +        msg_header.elem_count = (HEADER_SIZE + cmd_size).div_ceil(GSP_PAGE_SIZE) as u32;
> +        msg_header.pad = 0;
> +        self.seq += 1;
> +
> +        // SAFETY: ptr points to GSP_PAGE_SIZE bytes of memory which is larger
> +        // than both GspMsgHeader and GspRpcHeader combined.
> +        let rpc_header_slice: &mut [u8] = unsafe {
> +            core::slice::from_raw_parts_mut(
> +                ptr.cast::<u8>().add(size_of::<GspMsgHeader>()),
> +                size_of::<GspRpcHeader>(),
> +            )
> +        };
> +        rpc_header_slice.fill(0);
> +        let rpc_header = GspRpcHeader::from_bytes_mut(rpc_header_slice).ok_or(EINVAL)?;
> +        rpc_header.header_version = 0x03000000;
> +        rpc_header.signature = 0x43505256;
> +        rpc_header.length = (size_of::<GspRpcHeader>() + cmd_size) as u32;
> +        rpc_header.rpc_result = 0xffffffff;
> +        rpc_header.rpc_result_private = 0xffffffff;
> +        rpc_header.sequence = 0;
> +        rpc_header.cpu_rm_gfid = 0;
> +
> +        // Number of bytes left before we have to wrap the buffer
> +        let remaining = ((self.msg_count as usize - wptr) << GSP_PAGE_SHIFT) - HEADER_SIZE;
> +
> +        let (slice_1, slice_2) = if cmd_size <= remaining {
> +            // SAFETY: ptr points to a region of contiguous memory at least
> +            // cmd_size + HEADER_SIZE long.
> +            let slice_1: &mut [u8] = unsafe {
> +                core::slice::from_raw_parts_mut(ptr.cast::<u8>().add(HEADER_SIZE), cmd_size)
> +            };
> +            slice_1.fill(0);
> +            (slice_1, &mut [] as &mut [u8])
> +        } else {
> +            // SAFETY: ptr points to a region of contiguous memory remaining +
> +            // HEADER_SIZE bytes long.
> +            let slice_1: &mut [u8] = unsafe {
> +                core::slice::from_raw_parts_mut(ptr.cast::<u8>().add(HEADER_SIZE), remaining)
> +            };
> +            // SAFETY: By the invariants of CoherentAllocation gsp_mem.start_ptr_mut() is valid.
> +            let ptr = unsafe {
> +                core::ptr::addr_of_mut!((*self.gsp_mem.start_ptr_mut()).gspq.msgq.data[0])

I'm almost sure this should be `cpuq` instead of `gspq`.

> +            };
> +            // SAFETY: ptr points to a region of contiguous memory
> +            // self.msg_count GSP_PAGE_SIZE pages long.
> +            let slice_2: &mut [u8] =
> +                unsafe { core::slice::from_raw_parts_mut(ptr.cast::<u8>(), remaining - cmd_size) };
> +            slice_1.fill(0);

... and that you wanted to do `slice_2.fill(0)` here.

> +            (slice_1, slice_2)
> +        };

Overall the code of this method is quite difficult to follow, and its
safety statements almost impossible to verify due to its complexity. It
manipulates many things at the same time and this makes typos like the
two I highlighted above easy to slip in.

The problems start with the creation of `ptr`:

    let ptr = unsafe {
        core::ptr::addr_of_mut!((*self.gsp_mem.start_ptr_mut()).cpuq.msgq.data[wptr])
    };

For instance, we don't know for sure that the value of `wptr` is going
to be within bounds. We also don't want to check this at runtime every
time. I can track where `wptr` comes from and where it is updated, and
indeed if I do that it looks like it is clamped to not go beyond
`MSGQ_NUM_PAGES`, but verifying this is quite some labor and some
careless future patch might change that fact, especially since this
invariant is not documented.

So we want the range of `wptr` to be defined as an invariant, and we
want that invariant to be easy to verify, and hard to break. This can be
done by wrapping the TX header inside a type that hides its members so
the write pointer cannot be accidentally changed, and providing methods
to control how it is updated:

/// TX header for setting up a command queue with the GSP.
///
/// # Invariants
///
/// [`Self::write_ptr`] is guaranteed to return a value within the range `0..MSGQ_NUM_PAGES`.
#[repr(transparent)]
#[derive(Debug)]
struct MsgqTxHeader(nvfw::MsgqTxHeader);

/// SAFETY: `MsgqTxHeader` does not contain uninitialized bytes and does not have interior mutability.
unsafe impl AsBytes for MsgqTxHeader {}

impl MsgqTxHeader {
	  ...

	  /// # Invariants
	  ///
    /// The returned value is contained within the range `0..MSG_NUM_PAGES`.
    fn write_ptr(&self) -> u32 {
        self.0.write_ptr()
    }

    /// Advance the write pointer by `elem_count` units, wrapping around the ring buffer if
    /// necessary.
    fn advance_write_ptr(&mut self, elem_count: u32) {
				// INVARIANT: the write pointer is within the range `0..MSGQ_NUM_PAGES`.
        let wptr = self.write_ptr().wrapping_add(elem_count) % MSGQ_NUM_PAGES;
        self.0.set_write_ptr(wptr);

        fence(Ordering::SeqCst);
    }
}

Here all possible direct accesses to the write pointer are contained
within a page of code, and the invariant clearly documents what we can
expect for its value. So we can add a `BOUNDS:` statement to the
initialization of `ptr` invoking that invariant to justify that the
access will always be within bounds.

Another problem with `ptr` is that we create a mutable pointer over the
whole memory area shared with the GSP, and then write into it after more
casts and pointer arithmetic. How do we convince the reader that the GSP
cannot possibly be accessing the memory we are writing into at the same
time? I mean, after careful review I don't think the current code does,
but how do we demonstrate this simply?

Here again we need to rely on smaller, simple methods and precise
invariants.

The ring buffer of a queue is always split into two parts at any time:
one that is owned by the driver (for read or write access depending on
the type of the queue), and another one by the GSP. These areas are
delimited by the read and write pointers. In the case of the tx queue,
the driver has write-only access over the region is owns, and the GSP
can only grow that region by moving its own read pointer - so any
driver-owned region that we acquire cannot suddenly become invalid due
to GSP action.

So our first building block should be a method (of `GspMem`?) that
returns two mutable slices, covering the area of the TX ring buffer
currently owned by the driver. This method can rely on that invariant to
satisfy the safety requirement of `CoherentAllocation::as_slice_mut`
that no device access is taking place at the same time; that way, it
can confidently return clean mutable slices for the callers to use as
they want, and that should be the only unsafe statement needed to send a
message.

Callers can then check the available space on these slices, call
`slice::split_at_mut` and `FromBytes::from_bytes_mut` to split the slice
further and cast it to the header or command type they need, all that
without unsafe operations. And all made possible because we confined the
only unsafe operation to a small method which safety is easy to verify.

I'll stop here for now as that's already quite a bit of work. :) But
these are the essential points to fix, basically reducing the number of
unsafe statements and building our safety context from small, easy to
verify methods. Most types declared in `cmdq` should have at least an
`impl` block defining their interface in simple terms, rather that
having their fields manipulated by the types that contain them.