linux-kernel - Re: [RFC 7/7] gpu: nova-core: load the scrubber ucode when vGPU support is enabled

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <47c05bcf-7591-4148-8783-0c107b0c3c9d@nvidia.com>
Date: Sat, 6 Dec 2025 21:26:12 -0500
From: Joel Fernandes <joelagnelf@...dia.com>
To: Zhi Wang <zhiw@...dia.com>, rust-for-linux@...r.kernel.org,
 linux-pci@...r.kernel.org, nouveau@...ts.freedesktop.org,
 linux-kernel@...r.kernel.org
Cc: airlied@...il.com, dakr@...nel.org, aliceryhl@...gle.com,
 bhelgaas@...gle.com, kwilczynski@...nel.org, ojeda@...nel.org,
 alex.gaynor@...il.com, boqun.feng@...il.com, gary@...yguo.net,
 bjorn3_gh@...tonmail.com, lossin@...nel.org, a.hindborg@...nel.org,
 tmgross@...ch.edu, markus.probst@...teo.de, helgaas@...nel.org,
 cjia@...dia.com, alex@...zbot.org, smitra@...dia.com, ankita@...dia.com,
 aniketa@...dia.com, kwankhede@...dia.com, targupta@...dia.com,
 acourbot@...dia.com, jhubbard@...dia.com, zhiwang@...nel.org
Subject: Re: [RFC 7/7] gpu: nova-core: load the scrubber ucode when vGPU
 support is enabled

Hi Zhi,

On 12/6/2025 7:42 AM, Zhi Wang wrote:
> To support the maximum vGPUs on the device that support vGPU, a larger
> WPR2 heap size is required. By setting the WPR2 heap size larger than 256MB
> the scrubber ucode image is required to scrub the FB memory before any
> other ucode image is executed.
> 
> If not, the GSP firmware hangs when booting.
> 
> When vGPU support is enabled, execute the scrubber ucode image to scrub the
> FB memory before executing any other ucode images.
> 
[..]
>      pub(crate) const fn create(
> diff --git a/drivers/gpu/nova-core/firmware/booter.rs b/drivers/gpu/nova-core/firmware/booter.rs
> index f107f753214a..f622c9b960de 100644
> --- a/drivers/gpu/nova-core/firmware/booter.rs
> +++ b/drivers/gpu/nova-core/firmware/booter.rs
> @@ -269,6 +269,7 @@ fn new_booter(dev: &device::Device<device::Bound>, data: &[u8]) -> Result<Self>
>  
>  #[derive(Copy, Clone, Debug, PartialEq)]
>  pub(crate) enum BooterKind {
> +    Scrubber,
>      Loader,
>      #[expect(unused)]
>      Unloader,
> @@ -286,6 +287,7 @@ pub(crate) fn new(
>          bar: &Bar0,
>      ) -> Result<Self> {
>          let fw_name = match kind {
> +            BooterKind::Scrubber => "scrubber",
>              BooterKind::Loader => "booter_load",
>              BooterKind::Unloader => "booter_unload",
>          };
> diff --git a/drivers/gpu/nova-core/gsp/boot.rs b/drivers/gpu/nova-core/gsp/boot.rs
> index ec006c26f19f..8ef79433f017 100644
> --- a/drivers/gpu/nova-core/gsp/boot.rs
> +++ b/drivers/gpu/nova-core/gsp/boot.rs
> @@ -151,6 +151,33 @@ pub(crate) fn boot(
>  
>          Self::run_fwsec_frts(dev, gsp_falcon, bar, &bios, &fb_layout)?;

Could you elaborate on how the timeout below works? See comment below.

>  
> +        if vgpu_support {
> +            let scrubber = BooterFirmware::new(
> +                dev,
> +                BooterKind::Scrubber,
> +                chipset,
> +                FIRMWARE_VERSION,
> +                sec2_falcon,
> +                bar,
> +            )?;
> +
> +            sec2_falcon.reset(bar)?;
> +            sec2_falcon.dma_load(bar, &scrubber)?;
> +
> +            let (mbox0, mbox1) = sec2_falcon.boot(bar, None, None)?;

boot() already returns -ETIMEDOUT via wait_till_halted()->read_poll_timeout().

The wait there is 2 seconds. I assume the scrubber would have completed by then.

> +
> +            dev_dbg!(
> +                pdev.as_ref(),
> +                "SEC2 MBOX0: {:#x}, MBOX1{:#x}\n",
> +                mbox0,
> +                mbox1
> +            );
> +
> +            if !regs::NV_PGC6_BSI_SECURE_SCRATCH_15::read(bar).scrubber_completed() {
> +                return Err(ETIMEDOUT);

So under which situation do you get to this point (!scrubber_completed) ?
Basically I am not sure if ETIMEDOUT is the right error to return here, because
boot() already returns ETIMEDOUT by waiting for the halt.

If you still want return ETIMEDOUT here, then it sounds like you're waiting for
scrubbing beyond the waiting already done by boot(). If so, then shouldn't you
need to use read_poll_timeout() here?

perhaps something like:

 read_poll_timeout(
     || Ok(regs::NV_PGC6_BSI_SECURE_SCRATCH_15::read(bar).scrubber_completed()),
     |val: &bool| *val,
     Delta::from_millis(10),
     Delta::from_secs(5),
 )?;

Thanks.