[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN9PR11MB5276CF18ADF0B687429118918C522@BN9PR11MB5276.namprd11.prod.outlook.com>
Date: Sun, 18 Feb 2024 01:00:23 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: "ankita@...dia.com" <ankita@...dia.com>, "jgg@...dia.com"
<jgg@...dia.com>, "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
"yishaih@...dia.com" <yishaih@...dia.com>,
"shameerali.kolothum.thodi@...wei.com"
<shameerali.kolothum.thodi@...wei.com>, "mst@...hat.com" <mst@...hat.com>,
"eric.auger@...hat.com" <eric.auger@...hat.com>, "jgg@...pe.ca"
<jgg@...pe.ca>, "oleksandr@...alenko.name" <oleksandr@...alenko.name>,
"clg@...hat.com" <clg@...hat.com>, "K V P, Satyanarayana"
<satyanarayana.k.v.p@...el.com>, "brett.creeley@....com"
<brett.creeley@....com>, "horms@...nel.org" <horms@...nel.org>,
"shannon.nelson@....com" <shannon.nelson@....com>
CC: "aniketa@...dia.com" <aniketa@...dia.com>, "cjia@...dia.com"
<cjia@...dia.com>, "kwankhede@...dia.com" <kwankhede@...dia.com>,
"targupta@...dia.com" <targupta@...dia.com>, "vsethi@...dia.com"
<vsethi@...dia.com>, "Currid, Andy" <acurrid@...dia.com>,
"apopple@...dia.com" <apopple@...dia.com>, "jhubbard@...dia.com"
<jhubbard@...dia.com>, "danw@...dia.com" <danw@...dia.com>,
"anuaggarwal@...dia.com" <anuaggarwal@...dia.com>, "mochs@...dia.com"
<mochs@...dia.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"virtualization@...ts.linux-foundation.org"
<virtualization@...ts.linux-foundation.org>
Subject: RE: [PATCH v18 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for
grace hopper
> From: ankita@...dia.com <ankita@...dia.com>
> Sent: Friday, February 16, 2024 11:01 AM
>
> From: Ankit Agrawal <ankita@...dia.com>
>
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
>
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the chip-to-chip
> interconnect that is exposed as a contiguous physically addressable
> region on the host. This device memory aperture can be obtained from host
> ACPI table using device_property_read_u64(), according to the FW
> specification. Since the device memory is cache coherent with the CPU,
> it can be mmap into the user VMA with a cacheable mapping using
> remap_pfn_range() and used like a regular RAM. The device memory
> is not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
>
> There is also a requirement of a minimum reserved 1G uncached region
> (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1].
> This is to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> Qemu VMA page properties (pgprot) as NORMAL_NC.
>
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
>
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory region.
> The decision for a different approach was taken in view of the fact that
> it would necessiate additional code in Qemu to discover and insert those
> regions in the VM IPA, along with the additional VM ACPI DSDT changes to
> communicate the device memory region IPA to the VM workloads. Moreover,
> this behavior would have to be added to a variety of emulators (beyond
> top of tree Qemu) out there desiring grace hopper support.
>
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR (i.e.
> comprising of region 2 and 3). The cached device memory aperture is
> assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
> device in the VM with the uncached aperture reported as BAR2 region,
> the cacheable as BAR4. The variant driver provides emulation for these
> fake BARs' PCI config space offset registers.
>
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space
> register.
>
> The memory layout on the host looks like the following:
> devmem (memlength)
> |--------------------------------------------------|
> |-------------cached------------------------|--NC--|
> | |
> usemem.memphys resmem.memphys
>
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
>
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size
> memblock
> (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW
> and
> VFIO driver. The VM device driver make use of the same value for its
> calculation to determine USEMEM size.
>
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [3] and [4], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
>
> If the device memory properties are not present, the driver registers the
> vfio-pci-core function pointers. Since there are no ACPI memory properties
> generated for the VM, the variant driver inside the VM will only use
> the vfio-pci-core ops and hence try to map the BARs as non cached. This
> is not a problem as the CPUs have FWB enabled which blocks the VM
> mapping's ability to override the cacheability set by the host mapping.
>
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace Hopper Superchip firmware specification so
> that the guest operating system can see the correct ACPI modeling for
> the coherent GPU device. Verified with the CUDA workload in the VM.
>
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/
> [2] section D8.5.5 of
> https://developer.arm.com/documentation/ddi0487/latest/
> [3] https://lore.kernel.org/all/20240211174705.31992-1-ankita@nvidia.com/
> [4] https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/
> [5] https://github.com/NVIDIA/open-gpu-kernel-modules
> [6] https://lore.kernel.org/all/20231203060245.31593-1-ankita@nvidia.com/
>
> Signed-off-by: Aniket Agashe <aniketa@...dia.com>
> Signed-off-by: Ankit Agrawal <ankita@...dia.com>
Reviewed-by: Kevin Tian <kevin.tian@...el.com>
Powered by blists - more mailing lists