[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SA1PR12MB71996EBCA4142458E8BEE367B04B2@SA1PR12MB7199.namprd12.prod.outlook.com>
Date: Fri, 9 Feb 2024 09:20:22 +0000
From: Ankit Agrawal <ankita@...dia.com>
To: "Tian, Kevin" <kevin.tian@...el.com>, Jason Gunthorpe <jgg@...dia.com>,
"alex.williamson@...hat.com" <alex.williamson@...hat.com>, Yishai Hadas
<yishaih@...dia.com>, "mst@...hat.com" <mst@...hat.com>,
"shameerali.kolothum.thodi@...wei.com"
<shameerali.kolothum.thodi@...wei.com>, "clg@...hat.com" <clg@...hat.com>,
"oleksandr@...alenko.name" <oleksandr@...alenko.name>, "K V P, Satyanarayana"
<satyanarayana.k.v.p@...el.com>, "eric.auger@...hat.com"
<eric.auger@...hat.com>, "brett.creeley@....com" <brett.creeley@....com>,
"horms@...nel.org" <horms@...nel.org>, Rahul Rameshbabu
<rrameshbabu@...dia.com>
CC: Aniket Agashe <aniketa@...dia.com>, Neo Jia <cjia@...dia.com>, Kirti
Wankhede <kwankhede@...dia.com>, "Tarun Gupta (SW-GPU)"
<targupta@...dia.com>, Vikram Sethi <vsethi@...dia.com>, Andy Currid
<acurrid@...dia.com>, Alistair Popple <apopple@...dia.com>, John Hubbard
<jhubbard@...dia.com>, Dan Williams <danw@...dia.com>, "Anuj Aggarwal
(SW-GPU)" <anuaggarwal@...dia.com>, Matt Ochs <mochs@...dia.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "virtualization@...ts.linux-foundation.org"
<virtualization@...ts.linux-foundation.org>
Subject: Re: [PATCH v17 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for
grace hopper
Thanks Kevin for the review. Comments inline.
>>
>> Note that the usemem memory is added by the VM Nvidia device driver [5]
>> to the VM kernel as memblocks. Hence make the usable memory size
>> memblock
>> aligned.
>
> Is memblock size defined in spec or purely a guest implementation choice?
The MEMBLOCK value is a hardwired and a constant ABI value between the GPU
FW and VFIO driver.
>>
>> If the bare metal properties are not present, the driver registers the
>> vfio-pci-core function pointers.
>
> so if qemu doesn't generate such property the variant driver running
> inside guest will always go to use core functions and guest vfio userspace
> will observe both resmem and usemem bars. But then there is nothing
> in field to prohibit mapping resmem bar as cacheable.
>
> should this driver check the presence of either ACPI property or
> resmem/usemem bars to enable variant function pointers?
Maybe I am missing something here; but if the ACPI property is absent,
the real physical BARs present on the device will be exposed by the
vfio-pci-core functions to the VM. So I think if the variant driver is ran
within the VM, it should not see the fake usemem and resmem BARs.
>> +config NVGRACE_GPU_VFIO_PCI
>> + tristate "VFIO support for the GPU in the NVIDIA Grace Hopper
>> Superchip"
>> + depends on ARM64 || (COMPILE_TEST && 64BIT)
>> + select VFIO_PCI_CORE
>> + help
>> + VFIO support for the GPU in the NVIDIA Grace Hopper Superchip is
>> + required to assign the GPU device using KVM/qemu/etc
>
> "assign the GPU device to userspace"
Ack, will make the change.
>> +
>> +/* Memory size expected as non cached and reserved by the VM driver */
>> +#define RESMEM_SIZE 0x40000000
>> +#define MEMBLK_SIZE 0x20000000
>
> also add a comment for MEMBLK_SIZE
Yes.
>> +
>> +struct nvgrace_gpu_vfio_pci_core_device {
>
> will nvgrace refer to a non-gpu device? if not probably all prefixes with
> 'nvgrace_gpu' can be simplified to 'nvgrace'.
nvgrace_gpu indicates the GPU associated with grace CPU here. That is a
one of the reason behind keeping the nomenclature as nvgrace_gpu. Calling
it nvgrace may not be apt as it is a CPU.
> btw following other variant drivers 'vfio' can be removed too.
Ack.
>> +
>> +/*
>> + * Both the usable (usemem) and the reserved (resmem) device memory
>> region
>> + * are 64b fake BARs in the VM. These fake BARs must respond
>
> s/VM/device/
I can make it clearer to something like "64b fake device BARs in the VM".
>> + int ret;
>> +
>> + ret = vfio_pci_core_read(core_vdev, buf, count, ppos);
>> + if (ret < 0)
>> + return ret;
>
> here if core_read succeeds *ppos has been updated...
Thanks for pointing that out, will fix the code handling *ppos.
>> + * Read the data from the device memory (mapped either through ioremap
>> + * or memremap) into the user buffer.
>> + */
>> +static int
>> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device
>> *nvdev,
>> + char __user *buf, size_t mem_count, loff_t *ppos)
>> +{
>> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
>> + int ret;
>> +
>> + /*
>> + * Handle read on the BAR regions. Map to the target device memory
>> + * physical address and copy to the request read buffer.
>> + */
>
> duplicate with the earlier comment for the function.
Ack.
>> +/*
>> + * Read count bytes from the device memory at an offset. The actual device
>> + * memory size (available) may not be a power-of-2. So the driver fakes
>> + * the size to a power-of-2 (reported) when exposing to a user space driver.
>> + *
>> + * Reads extending beyond the reported size are truncated; reads starting
>> + * beyond the reported size generate -EINVAL; reads extending beyond the
>> + * actual device size is filled with ~0.
>
> slightly clearer to order the description: read starting beyond reported
> size, then read extending beyond device size, and read extending beyond
> reported size.
Yes, will make the change.
>> + * exposing the rest (termed as usable memory and represented
>> using usemem)
>> + * as cacheable 64b BAR (region 4 and 5).
>> + *
>> + * devmem (memlength)
>> + * |-------------------------------------------------|
>> + * | |
>> + * usemem.phys/memphys resmem.phys
>
> there is no usemem.phys and resmem.phys
Silly miss. Will fix that.
>> + */
>> + nvdev->usemem.memphys = memphys;
>> +
>> + /*
>> + * The device memory exposed to the VM is added to the kernel by
>> the
>> + * VM driver module in chunks of memory block size. Only the usable
>> + * memory (usemem) is added to the kernel for usage by the VM
>> + * workloads. Make the usable memory size memblock aligned.
>> + */
>
> If memblock size is defined by hw spec then say so.
> otherwise this sounds a broken contract if it's a guest-decided value.
Answered above. Will update the comment to mention that.
>> + if (check_sub_overflow(memlength, RESMEM_SIZE,
>> + &nvdev->usemem.memlength)) {
>> + ret = -EOVERFLOW;
>> + goto done;
>> + }
>
> does resmem require 1G-aligned?
No, the requirement is it to be 1G+. No alignment restrictions there.
> if usemem.memlength becomes 0 then should return error too.
I suppose you mean if it becomes 0 after the subtraction above. We are
checking for the 0 size in the host ACPI table otherwise.
Powered by blists - more mailing lists