linux-kernel - RE: [PATCH v18 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN9PR11MB5276CF18ADF0B687429118918C522@BN9PR11MB5276.namprd11.prod.outlook.com>
Date: Sun, 18 Feb 2024 01:00:23 +0000
From: "Tian, Kevin" <kevin.tian@...el.com>
To: "ankita@...dia.com" <ankita@...dia.com>, "jgg@...dia.com"
	<jgg@...dia.com>, "alex.williamson@...hat.com" <alex.williamson@...hat.com>,
	"yishaih@...dia.com" <yishaih@...dia.com>,
	"shameerali.kolothum.thodi@...wei.com"
	<shameerali.kolothum.thodi@...wei.com>, "mst@...hat.com" <mst@...hat.com>,
	"eric.auger@...hat.com" <eric.auger@...hat.com>, "jgg@...pe.ca"
	<jgg@...pe.ca>, "oleksandr@...alenko.name" <oleksandr@...alenko.name>,
	"clg@...hat.com" <clg@...hat.com>, "K V P, Satyanarayana"
	<satyanarayana.k.v.p@...el.com>, "brett.creeley@....com"
	<brett.creeley@....com>, "horms@...nel.org" <horms@...nel.org>,
	"shannon.nelson@....com" <shannon.nelson@....com>
CC: "aniketa@...dia.com" <aniketa@...dia.com>, "cjia@...dia.com"
	<cjia@...dia.com>, "kwankhede@...dia.com" <kwankhede@...dia.com>,
	"targupta@...dia.com" <targupta@...dia.com>, "vsethi@...dia.com"
	<vsethi@...dia.com>, "Currid, Andy" <acurrid@...dia.com>,
	"apopple@...dia.com" <apopple@...dia.com>, "jhubbard@...dia.com"
	<jhubbard@...dia.com>, "danw@...dia.com" <danw@...dia.com>,
	"anuaggarwal@...dia.com" <anuaggarwal@...dia.com>, "mochs@...dia.com"
	<mochs@...dia.com>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"virtualization@...ts.linux-foundation.org"
	<virtualization@...ts.linux-foundation.org>
Subject: RE: [PATCH v18 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for
 grace hopper

> From: ankita@...dia.com <ankita@...dia.com>
> Sent: Friday, February 16, 2024 11:01 AM
> 
> From: Ankit Agrawal <ankita@...dia.com>
> 
> NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
> for the on-chip GPU that is the logical OS representation of the
> internal proprietary chip-to-chip cache coherent interconnect.
> 
> The device is peculiar compared to a real PCI device in that whilst
> there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
> device, it is not used to access device memory once the faster
> chip-to-chip interconnect is initialized (occurs at the time of host
> system boot). The device memory is accessed instead using the chip-to-chip
> interconnect that is exposed as a contiguous physically addressable
> region on the host. This device memory aperture can be obtained from host
> ACPI table using device_property_read_u64(), according to the FW
> specification. Since the device memory is cache coherent with the CPU,
> it can be mmap into the user VMA with a cacheable mapping using
> remap_pfn_range() and used like a regular RAM. The device memory
> is not added to the host kernel, but mapped directly as this reduces
> memory wastage due to struct pages.
> 
> There is also a requirement of a minimum reserved 1G uncached region
> (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1].
> This is to work around a HW defect. Based on [2], the requisite properties
> (uncached, unaligned access) can be achieved through a VM mapping (S1)
> of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
> a different non-cached property to the reserved 1G region, it needs to
> be carved out from the device memory and mapped as a separate region
> in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
> Qemu VMA page properties (pgprot) as NORMAL_NC.
> 
> Provide a VFIO PCI variant driver that adapts the unique device memory
> representation into a more standard PCI representation facing userspace.
> 
> The variant driver exposes these two regions - the non-cached reserved
> (resmem) and the cached rest of the device memory (termed as usemem) as
> separate VFIO 64b BAR regions. This is divergent from the baremetal
> approach, where the device memory is exposed as a device memory region.
> The decision for a different approach was taken in view of the fact that
> it would necessiate additional code in Qemu to discover and insert those
> regions in the VM IPA, along with the additional VM ACPI DSDT changes to
> communicate the device memory region IPA to the VM workloads. Moreover,
> this behavior would have to be added to a variety of emulators (beyond
> top of tree Qemu) out there desiring grace hopper support.
> 
> Since the device implements 64-bit BAR0, the VFIO PCI variant driver
> maps the uncached carved out region to the next available PCI BAR (i.e.
> comprising of region 2 and 3). The cached device memory aperture is
> assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
> device in the VM with the uncached aperture reported as BAR2 region,
> the cacheable as BAR4. The variant driver provides emulation for these
> fake BARs' PCI config space offset registers.
> 
> The hardware ensures that the system does not crash when the memory
> is accessed with the memory enable turned off. It synthesis ~0 reads
> and dropped writes on such access. So there is no need to support the
> disablement/enablement of BAR through PCI_COMMAND config space
> register.
> 
> The memory layout on the host looks like the following:
>                devmem (memlength)
> |--------------------------------------------------|
> |-------------cached------------------------|--NC--|
> |                                           |
> usemem.memphys                              resmem.memphys
> 
> PCI BARs need to be aligned to the power-of-2, but the actual memory on the
> device may not. A read or write access to the physical address from the
> last device PFN up to the next power-of-2 aligned physical address
> results in reading ~0 and dropped writes. Note that the GPU device
> driver [6] is capable of knowing the exact device memory size through
> separate means. The device memory size is primarily kept in the system
> ACPI tables for use by the VFIO PCI variant module.
> 
> Note that the usemem memory is added by the VM Nvidia device driver [5]
> to the VM kernel as memblocks. Hence make the usable memory size
> memblock
> (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW
> and
> VFIO driver. The VM device driver make use of the same value for its
> calculation to determine USEMEM size.
> 
> Currently there is no provision in KVM for a S2 mapping with
> MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
> As previously mentioned, resmem is mapped pgprot_writecombine(), that
> sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
> proposed changes in [3] and [4], KVM marks the region with
> MemAttr[2:0]=0b101 in S2.
> 
> If the device memory properties are not present, the driver registers the
> vfio-pci-core function pointers. Since there are no ACPI memory properties
> generated for the VM, the variant driver inside the VM will only use
> the vfio-pci-core ops and hence try to map the BARs as non cached. This
> is not a problem as the CPUs have FWB enabled which blocks the VM
> mapping's ability to override the cacheability set by the host mapping.
> 
> This goes along with a qemu series [6] to provides the necessary
> implementation of the Grace Hopper Superchip firmware specification so
> that the guest operating system can see the correct ACPI modeling for
> the coherent GPU device. Verified with the CUDA workload in the VM.
> 
> [1] https://www.nvidia.com/en-in/technologies/multi-instance-gpu/
> [2] section D8.5.5 of
> https://developer.arm.com/documentation/ddi0487/latest/
> [3] https://lore.kernel.org/all/20240211174705.31992-1-ankita@nvidia.com/
> [4] https://lore.kernel.org/all/20230907181459.18145-2-ankita@nvidia.com/
> [5] https://github.com/NVIDIA/open-gpu-kernel-modules
> [6] https://lore.kernel.org/all/20231203060245.31593-1-ankita@nvidia.com/
> 
> Signed-off-by: Aniket Agashe <aniketa@...dia.com>
> Signed-off-by: Ankit Agrawal <ankita@...dia.com>

Reviewed-by: Kevin Tian <kevin.tian@...el.com>