linux-kernel - Re: [PATCH v3 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250117143928.13edc014.alex.williamson@redhat.com>
Date: Fri, 17 Jan 2025 14:39:28 -0500
From: Alex Williamson <alex.williamson@...hat.com>
To: Ankit Agrawal <ankita@...dia.com>
Cc: Jason Gunthorpe <jgg@...dia.com>, Yishai Hadas <yishaih@...dia.com>,
 "shameerali.kolothum.thodi@...wei.com"
 <shameerali.kolothum.thodi@...wei.com>, "kevin.tian@...el.com"
 <kevin.tian@...el.com>, Zhi Wang <zhiw@...dia.com>, Aniket Agashe
 <aniketa@...dia.com>, Neo Jia <cjia@...dia.com>, Kirti Wankhede
 <kwankhede@...dia.com>, "Tarun Gupta (SW-GPU)" <targupta@...dia.com>,
 Vikram Sethi <vsethi@...dia.com>, Andy Currid <acurrid@...dia.com>,
 Alistair Popple <apopple@...dia.com>, John Hubbard <jhubbard@...dia.com>,
 Dan Williams <danw@...dia.com>, "Anuj Aggarwal (SW-GPU)"
 <anuaggarwal@...dia.com>, Matt Ochs <mochs@...dia.com>,
 "kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-kernel@...r.kernel.org"
 <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C
 link status

On Fri, 17 Jan 2025 19:19:42 +0000
Ankit Agrawal <ankita@...dia.com> wrote:

> >> +/*
> >> + * To reduce the system bootup time, the HBM training has
> >> + * been moved out of the UEFI on the Grace-Blackwell systems.
> >> + *
> >> + * The onus of checking whether the HBM training has completed
> >> + * thus falls on the module. The HBM training status can be
> >> + * determined from a BAR0 register.
> >> + *
> >> + * Similarly, another BAR0 register exposes the status of the
> >> + * CPU-GPU chip-to-chip (C2C) cache coherent interconnect.
> >> + *
> >> + * Poll these register and check for 30s. If the HBM training is
> >> + * not complete or if the C2C link is not ready, fail the probe.
> >> + *
> >> + * While the wait is not required on Grace Hopper systems, it
> >> + * is beneficial to make the check to ensure the device is in an
> >> + * expected state.
> >> + */
> >> +static int nvgrace_gpu_wait_device_ready(struct pci_dev *pdev)
> >> +{
> >> +     unsigned long timeout = jiffies + msecs_to_jiffies(POLL_TIMEOUT_MS);
> >> +     void __iomem *io;
> >> +     int ret = -ETIME;
> >> +
> >> +     io = pci_iomap(pdev, 0, 0);
> >> +     if (!io)
> >> +             return -ENOMEM;
> >> +
> >> +     do {
> >> +             if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
> >> +                 (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
> >> +                     ret = 0;
> >> +                     goto reg_check_exit;
> >> +             }
> >> +             msleep(POLL_QUANTUM_MS);
> >> +     } while (!time_after(jiffies, timeout));
> >> +
> >> +reg_check_exit:
> >> +     pci_iounmap(pdev, io);
> >> +     return ret;  
> >
> > We're accessing device memory here but afaict the memory enable bit of
> > the command register is in an indeterminate state.  What happens if you
> > use setpci to clear the memory enable bit or 'echo 0 > enable' before
> > binding the driver?  Thanks,
> >
> > Alex  
> 
> Hi Alex, sorry I didn't understand how we are accessing device memory here if
> the C2C_LINK_BAR0_OFFSET and HBM_TRAINING_BAR0_OFFSET are BAR0 regs.
> But anyways, I tried 'echo 0 > <sysfs_path>/enable' before device bind. I am not
> observing any issue and the bind goes through.
> 
> Or am I missing something? 

BAR0 is what I'm referring to as device memory.  We cannot access
registers in BAR0 unless the memory space enable bit of the command
register is set.  The nvgrace-gpu driver makes no effort to enable this
and I don't think the PCI core does before probe either.  Disabling
through sysfs will only disable if it was previously enabled, so
possibly that test was invalid.  Please try with setpci:

# Read command register
$ setpci -s xxxx:xx:xx.x COMMAND
# Clear memory enable
$ setpci -s xxxx:xx:xx.x COMMAND=0:2
# Re-read command register
$ setpci -s xxxx:xx:xx.x COMMAND

Probe driver here now that the memory enable bit should re--back as
unset.  Thanks,

Alex