[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <DD66BA7N6113.3G9J4XG0RX08V@nvidia.com>
Date: Tue, 30 Sep 2025 22:36:20 +0900
From: "Alexandre Courbot" <acourbot@...dia.com>
To: "Alistair Popple" <apopple@...dia.com>, "Alexandre Courbot"
<acourbot@...dia.com>
Cc: <rust-for-linux@...r.kernel.org>, <dri-devel@...ts.freedesktop.org>,
<dakr@...nel.org>, "Miguel Ojeda" <ojeda@...nel.org>, "Alex Gaynor"
<alex.gaynor@...il.com>, "Boqun Feng" <boqun.feng@...il.com>, "Gary Guo"
<gary@...yguo.net>, Björn Roy Baron
<bjorn3_gh@...tonmail.com>, "Benno Lossin" <lossin@...nel.org>, "Andreas
Hindborg" <a.hindborg@...nel.org>, "Alice Ryhl" <aliceryhl@...gle.com>,
"Trevor Gross" <tmgross@...ch.edu>, "David Airlie" <airlied@...il.com>,
"Simona Vetter" <simona@...ll.ch>, "Maarten Lankhorst"
<maarten.lankhorst@...ux.intel.com>, "Maxime Ripard" <mripard@...nel.org>,
"Thomas Zimmermann" <tzimmermann@...e.de>, "John Hubbard"
<jhubbard@...dia.com>, "Joel Fernandes" <joelagnelf@...dia.com>, "Timur
Tabi" <ttabi@...dia.com>, <linux-kernel@...r.kernel.org>,
<nouveau@...ts.freedesktop.org>
Subject: Re: [PATCH v2 05/10] gpu: nova-core: gsp: Add GSP command queue
handling
On Tue Sep 30, 2025 at 9:36 AM JST, Alistair Popple wrote:
> On 2025-09-30 at 00:34 +1000, Alexandre Courbot <acourbot@...dia.com> wrote...
>> On Mon Sep 29, 2025 at 3:19 PM JST, Alistair Popple wrote:
>> <snip>
>> >> > +
>> >> > +/// Number of GSP pages making the Msgq.
>> >> > +pub(crate) const MSGQ_NUM_PAGES: u32 = 0x3f;
>> >> > +
>> >> > +#[repr(C, align(0x1000))]
>> >> > +#[derive(Debug)]
>> >> > +struct MsgqData {
>> >> > + data: [[u8; GSP_PAGE_SIZE]; MSGQ_NUM_PAGES as usize],
>> >> > +}
>> >> > +
>> >> > +// Annoyingly there is no real equivalent of #define so we're forced to use a
>> >> > +// literal to specify the alignment above. So check that against the actual GSP
>> >> > +// page size here.
>> >> > +static_assert!(align_of::<MsgqData>() == GSP_PAGE_SIZE);
>> >> > +
>> >> > +// There is no struct defined for this in the open-gpu-kernel-source headers.
>> >> > +// Instead it is defined by code in GspMsgQueuesInit().
>> >> > +#[repr(C)]
>> >> > +struct Msgq {
>> >> > + tx: MsgqTxHeader,
>> >> > + rx: MsgqRxHeader,
>> >> > + msgq: MsgqData,
>> >> > +}
>> >> > +
>> >> > +#[repr(C)]
>> >> > +struct GspMem {
>> >> > + ptes: [u8; GSP_PAGE_SIZE],
>> >> > + cpuq: Msgq,
>> >> > + gspq: Msgq,
>> >> > +}
>> >> > +
>> >> > +// SAFETY: These structs don't meet the no-padding requirements of AsBytes but
>> >> > +// that is not a problem because they are not used outside the kernel.
>> >> > +unsafe impl AsBytes for GspMem {}
>> >> > +
>> >> > +// SAFETY: These structs don't meet the no-padding requirements of FromBytes but
>> >> > +// that is not a problem because they are not used outside the kernel.
>> >> > +unsafe impl FromBytes for GspMem {}
>> >>
>> >> These ARE used outside the kernel, since they are shared with the GSP.
>> >> :) I'd say the reason these are safe is just because we satisfy the
>> >> requirements of AsBytes and FromBytes:
>> >
>> > Yes, and they're actually old safety comments that I missed updating to be the
>> > same as the rest.
>> >
>> >> - No initialized bytes,
>> >> - No interior mutability,
>> >> - All bytes patterns are valid (for some generous definition of
>> >> "valid" limited to not triggering UB).
>> >
>> > I was under the impression that that's all `unsafe` is really marking - code
>> > the compiler can't prove won't trigger UB. So I'd assume that's all we'd have to
>> > prove in safety comments anyway.
>>
>> A good rule of thumb for writing `SAFETY` statements is to look at the
>> `Safety` section of the unsafe code we call (here `FromBytes`), and
>> justify why our calling code satisfies each of these.
>
> Oh that's usually where I get my inspiration for them from. This was more a
> statement of my understanding of the language design, that unsafe is mostly
> about UB rather than overall struct validity. After all I think it is still
> possible to have safe code initialise fields of a struct in "invalid" ways
> (depending on the struct).
No you are correct, unsafe blocks are blocks of code where the
responsibility for preventing UB shifts from the compiler to the
developer. The statements in `FromBytes` are here to prevent UB.
For instance, if you have the following:
#[repr(u8, C)]
enum Switch {
Off = 0,
On = 1,
}
#[repr(C)]
struct Light {
switch: Switch,
intensity: u8,
}
Then the compiler can make the assumption that `switch` will only ever
be 0 or 1, and allow UB when it is not. `FromBytes` cannot be
implemented for that structure. So in that sense, struct validity must
be withheld.
What does not need to be withheld, is e.g. GSP messages that are
structurally sound from a type point of view, but don't make any sense
semantically. For these the code should produce a runtime error while
validating them - but this is not a source of UB as long as the types
they contain indeed satisfy the requirement that any bit pattern is
valid for them.
(tbh I am not sure myself why interior mutability is not allowed)
>
>> >> > + dev: ARef<device::Device>,
>> >> > + seq: u32,
>> >> > + gsp_mem: DmaGspMem,
>> >> > + pub _nr_ptes: u32,
>> >> > +}
>> >> > +
>> >> > +impl GspCmdq {
>> >> > + pub(crate) fn new(dev: &device::Device<device::Bound>) -> Result<GspCmdq> {
>> >> > + let gsp_mem = DmaGspMem::new(dev)?;
>> >> > + let nr_ptes = size_of::<GspMem>() >> GSP_PAGE_SHIFT;
>> >> > + build_assert!(nr_ptes * size_of::<u64>() <= GSP_PAGE_SIZE);
>> >> > +
>> >> > + Ok(GspCmdq {
>> >> > + dev: dev.into(),
>> >> > + seq: 0,
>> >> > + gsp_mem,
>> >> > + _nr_ptes: nr_ptes as u32,
>> >> > + })
>> >> > + }
>> >> > +
>> >> > + fn calculate_checksum<T: Iterator<Item = u8>>(it: T) -> u32 {
>> >> > + let sum64 = it
>> >> > + .enumerate()
>> >> > + .map(|(idx, byte)| (((idx % 8) * 8) as u32, byte))
>> >> > + .fold(0, |acc, (rol, byte)| acc ^ u64::from(byte).rotate_left(rol));
>> >> > +
>> >> > + ((sum64 >> 32) as u32) ^ (sum64 as u32)
>> >> > + }
>> >> > +
>> >> > + #[expect(unused)]
>> >> > + pub(crate) fn send_gsp_command<M: GspCommandToGsp>(
>> >> > + &mut self,
>> >> > + bar: &Bar0,
>> >> > + payload_size: usize,
>> >> > + init: impl FnOnce(&mut M, SBuffer<core::array::IntoIter<&mut [u8], 2>>) -> Result,
>> >>
>> >> This works pretty well for in-place initialization.
>> >>
>> >> There a couple of small drawbacks though: `M` must implement `FromBytes`
>> >> even though we only send it, and (more serious) there is no guarantee
>> >
>> > Yes, that detail annoyed me slightly too.
>> >
>> >> that `init` will fully initialize the command - a forgetful caller can
>> >> omit some of its fields, or even the whole structure, and in that case
>> >> we will send a command with what happened to be at that position of the
>> >> queue at that time.
>> >
>> > Good timing as I was just looking to see if there wasn't some way of ensuring
>> > this happened, or at least was much more explicit than what we currently do.
>> >
>> >> I think this is a good case for using the `Init` trait: it's like
>> >> `PinInit`, but without the `Pin`, and it ensures that the whole argument
>> >> is initialized. So this method would change into something like:
>> >>
>> >> pub(crate) fn send_gsp_command<M: GspCommandToGsp>(
>> >> &mut self,
>> >> bar: &Bar0,
>> >> payload_size: usize,
>> >> init: impl Init<M, Error>,
>> >> init_sbuf: impl FnOnce(SBuffer<core::array::IntoIter<&mut [u8], 2>>) -> Result,
>> >>
>> >> This also allows us to drop the `FromBytes` requirement on
>> >> `GspCommandToGsp`! But also requires us to use `unsafe` to call
>> >> `Init::__init` on the pointer to the command. I think it's worth it
>> >> though, as this removes the risk of sending partially-uninitialized
>> >> commands.
>> >
>> > Agree on it being worth the unsafe call, especially because it is unsafe if the
>> > caller doesn't do what it's supposed to regardless. But how does this contrast
>> > with `MaybeUninit`? I was wondering if something like this would work:
>> >
>> > pub(crate) fn send_gsp_command<M: GspCommandToGsp>(
>> > &mut self,
>> > bar: &Bar0,
>> > payload_size: usize,
>> > init: impl FnOnce(MaybeUninit<&mut M>, SBuffer<core::array::IntoIter<&mut [u8], 2>>) -> Result<&M>,
>> >
>> > Note I'm not sure we could actually make that work as I haven't tried it yet. I
>> > suspect I've missed something with lifetimes there :-)
>>
>> The difference with using `MaybeUninit` is that the unsafe code would
>> need to be *within* each `init` argument - IOW, within each caller. It
>> also does not provide any guarantee that the whole `M` is initialized,
>> we have to trust each caller for doing that properly.
>
> I suggested it because I was actually thinking that would be a feature :-) After
> all callers are the ones doing the initialisation and so have to be trusted to
> do that, and the unsafe call to `assume_init()` would be the indication they
> have to do something right.
>
> But I'm not particularly tied to using that, it was an idle thought/question. I
> do wonder how Init gets around callers having to guarantee the whole of `M` is
> initialised, so will have to go study it a bit more closely.
The only reason for `MaybeUninit` to be `unsafe` is because the user
might leave fields uninitialized or set to values that are invalid for
their type. The unsafe operation is here to shift the blame on the
developer who forget to set their fields properly.
With `Init`, such mistake cannot happen, as all fields must be
initialized (like a regular initialization) or you get a build error. So
we know for sure that the structure is properly initialized, and UB
cannot occur, hence no need for unsafe.
Of course the command thus initialized can still be invalid in the
semantic sense - but in this case the GSP will detect that upon
reception, and send us an error. As far as the compiler is concerned,
this is working as intended.
>> >
>> >
>> >> Then there is what to do with the `SBuffer`. I'd like to think a bit
>> >> more about this, as not all commands require it, and those that do
>> >> typically send arrays of particular types. I think we should be able to
>> >> use the type system to have more control over this, but let's keep that
>> >> for the next revision.
>> >
>> > Sure. You are correct that not all (most?) commands don't need it, and it's a
>> > shame we don't have proper bindings for them anyway. Although in some cases that
>> > still wouldn't work anyway (eg. for registry keys) where it really is just a
>> > packed variable number of variable length strings. Not sure what a Rust native
>> > representation of that would be.
>>
>> For now the best I could think of is to have not one but two traits for
>> GSP commands: one for commands with payload and another one for commands
>> without payload. And two variants of `send_gsp_command` for each case.
>> That way we can at least guarantee that we won't pass a payload where we
>> shouldn't, or forget one where we should.
>
> We can't guarantee that (at least with the current bindings) because we
> still need to manually mark up the types with the correct traits. But if the
> majority of commands don't have a payload I agree it would provide some nice
> simplifications for callers.
Yes, the idea would be that each command implements one trait or the
other depending on whether it needs an extra payload. This needs to be
done manually, but we are already manually implementing `CommandToGsp`
so this is not much different.
>> >> > +
>> >> > + // Can't fail because msg_slice will always be
>> >> > + // size_of::<GspMsgElement>() bytes long by the above split.
>> >> > + let msg_header = GspMsgElement::from_bytes(msg_header_slice).unwrap();
>> >> > + if msg_header.rpc.length < size_of::<M>() as u32 {
>> >> > + return None;
>> >> > + }
>> >>
>> >> If we have a message in the queue and this condition doesn't hold, I
>> >> don't think we can hope that it will in a further iteration - this
>> >> should be an error.
>> >
>> > I was thinking messages may get partially written and we'd see this state, but
>> > you're right - the shared pointers shouldn't be updated until the entire message
>> > is written so this should be an error.
>> >
>> > That will require the `wait_on_result()` function, because `wait_on()` doesn't
>> > allow the closure to return a failure. We could just return Option<Result<_>>
>> > from the closure but that's a bit gross, because wait_on() would then return
>> > Result<Option<Result<_>>>.
>>
>> We can now use `read_poll_timeout` (from `kernel::io::poll`) which
>> separates the acquisition of the state and the test of the condition
>> into two separate closures - the first one returns a `Result`, and its
>> error is propagated as-is, which is what we want in this case.
>
> Oh nice. That looks like it should do the job.
Note that this won't be usable in drm-rust-next until -rc1 is tagged,
but we can assume it will be usable for us.
>
>> Actually this reminded me that I should send a patch to remove `wait_on`
>> altogether now that we have a better alternative! :)
>
> I guess the implies we shouldn't add new callers of `wait_on` ... not sure
> if there are others in this series but will take a look and convert them to
> read_poll_timeout.
There are a few - if we could use read_poll_timeout proactively that
would be great.
>> >> > +pub(crate) type MsgqTxHeader = bindings::msgqTxHeader;
>> >>
>> >> This should be wrapped into a newtype that provides the exact set of
>> >> features required by the gsp module, like what is done for
>> >> `LibosMemoryRegionInitArgument`. For this type we just need two things:
>> >> return the `writePtr`, and advance it by a given value. That's all
>> >> the parent module needs to see.
>> >
>> > Except that doesn't actually work, because we need to use DMA access methods on
>> > bindings::msgqTxHeader which we can't do from an accessor.
>>
>> If the accessor uses `read_volatile` and `write_volatile` then I guess
>> it should work? These are unsafe though, so this is not ideal, but at
>> least they would be contained into the `fw` module and limited to one
>> getter and one setter.
>
> Sure, `read_volatile()` and `write_volatile` would work but then we're adding
> more unsafe calls and going around the DMA subsystem which has special
> exceptions for justifying the SAFETY comments.
Honestly I myself don't understand why we can call `dma_write!` safely
wherever we want - technically there is nothing that guarantees that
what we access is not concurrently accessed by the device.
The volatile accesses are unsafe because they operate on pointers. I'd
rather do without but we will need them if we want to keep the bindings
opaque.
>
>> >
>> >> By just aliasing this type, we make all its members visible to the `gsp`
>> >> module. This increases its dependency on a given firmware version,
>> >> carries a risk that the GSP module will mess with what it is not
>> >> supposed to, and introduces inconsistency in how we abstract the
>> >> firmware types - some are wrapped, others are not. Let's be consistent
>> >> and make all bindings completely opaque outside of `fw.rs`.
>> >
>> > I can't see how accessors in `fw.rs` for every firmware binding we use is useful
>> > though[1]? In what way does that reduce dependency on a given firmware version?
>> > I can't see how it isn't just adding boiler plate for every struct field we want
>> > to access.
>>
>> While the main point of this remains hiding stuff other modules don't
>> need to see, it also allows us to choose stable method names that are
>> independent from the firmware.
>
> So the concern is firmware will rename struct fields?
That's one of the funny things the firmware can do, amongst others. :)
And yes, I agree that if the firmware changes too much this can
invalidate the whole of the GSP module, no abstraction will ever protect
us completely against that. But `fw` can be our first line of defense,
and also a logical break similar to the role `rust/kernel` fulfills
for the larger kernel [1]: the `kernel` crate is allowed to call the
C bindings, but not the drivers. Similarly, to access our GSP bindings,
we need to go through the `fw` sanitization layer.
[1] https://docs.kernel.org/next/rust/general-information.html#abstractions-vs-bindings
So to conclude on this topic (since we also synced offline and I see you
already have a v3 in which the bindings don't leak `fw`: yes, we need a
thin shim on top of the bindings to make them opaque, and yes this
results in a bit more code overall, but it also makes the upper layer
(gsp, cmdq) more concise and easier to read. No design is set in stone,
and as we get more visibility into what we need things can perfectly
evolve, but I think this is a reasonable state for now that requires
only a very modest amount of labor.
Powered by blists - more mailing lists