[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SA1PR12MB719904680D5961E6F1806F12B0702@SA1PR12MB7199.namprd12.prod.outlook.com>
Date: Fri, 19 Jan 2024 03:33:19 +0000
From: Ankit Agrawal <ankita@...dia.com>
To: Alex Williamson <alex.williamson@...hat.com>
CC: Jason Gunthorpe <jgg@...dia.com>, Yishai Hadas <yishaih@...dia.com>,
"shameerali.kolothum.thodi@...wei.com"
<shameerali.kolothum.thodi@...wei.com>, "kevin.tian@...el.com"
<kevin.tian@...el.com>, "eric.auger@...hat.com" <eric.auger@...hat.com>,
"brett.creeley@....com" <brett.creeley@....com>, "horms@...nel.org"
<horms@...nel.org>, Aniket Agashe <aniketa@...dia.com>, Neo Jia
<cjia@...dia.com>, Kirti Wankhede <kwankhede@...dia.com>, "Tarun Gupta
(SW-GPU)" <targupta@...dia.com>, Vikram Sethi <vsethi@...dia.com>, Andy
Currid <acurrid@...dia.com>, Alistair Popple <apopple@...dia.com>, John
Hubbard <jhubbard@...dia.com>, Dan Williams <danw@...dia.com>, "Anuj Aggarwal
(SW-GPU)" <anuaggarwal@...dia.com>, Matt Ochs <mochs@...dia.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v16 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for
grace hopper
>> Signed-off-by: Ankit Agrawal <ankita@...dia.com>
>> Signed-off-by: Aniket Agashe <aniketa@...dia.com>
>> Tested-by: Ankit Agrawal <ankita@...dia.com>
>
> Dunno about others, but I sure hope and assume the author tests ;)
> Sometimes I'm proven wrong.
Yeah, does not hurt to keep it then I suppose. :)
>> +
>> +#include "nvgrace_gpu_vfio_pci.h"
>
> Could probably just shorten this to nvgrace_gpu.h, but with just a
> single source file, we don't need a separate header. Put it inline here.
Ack.
>> +
>> +/* Choose the structure corresponding to the fake BAR with a given index. */
>> +struct mem_region *
>
> static
Yes.
>> + *
>> + * The available GPU memory size may not be power-of-2 aligned. Map up
>> + * to the size of the device memory. If the memory access is beyond the
>> + * actual GPU memory size, it will be handled by the vfio_device_ops
>> + * read/write.
>
> The phrasing "[m]ap up to the size" suggests the behavior of previous
> versions where we'd truncate mappings. Maybe something like:
>
> * The available GPU memory size may not be power-of-2 aligned.
> * The remainder is only backed by read/write handlers.
Got it. Will fix.
>> +
>> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
>> + ©_offset, ©_count,
>> + ®ister_offset)) {
>> + val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(nvdev->resmem.memlength),
>> + PCI_BASE_ADDRESS_MEM_TYPE_64 |
>> + PCI_BASE_ADDRESS_MEM_PREFETCH,
>> + nvdev->resmem.u64_reg);
>> + if (copy_to_user(buf + copy_offset,
>> + (void *)&val64 + register_offset, copy_count))
>> + return -EFAULT;
>> + }
>> +
>> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
>> + ©_offset, ©_count,
>> + ®ister_offset)) {
>> + val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(nvdev->usemem.memlength),
>> + PCI_BASE_ADDRESS_MEM_TYPE_64 |
>> + PCI_BASE_ADDRESS_MEM_PREFETCH,
>> + nvdev->usemem.u64_reg);
>> + if (copy_to_user(buf + copy_offset,
>> + (void *)&val64 + register_offset, copy_count))
>> + return -EFAULT;
>> + }
>
> Both read and write could be simplified a bit:
>
> if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
> ©_offset, ©_count,
> ®ister_offset))
> memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(RESMEM_REGION_INDEX, nvdev);
> else if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
> ©_offset, ©_count,
> ®ister_offset))
> memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(USEMEM_REGION_INDEX, nvdev);
>
> if (memregion) {
> val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(memregion->memlength),
> PCI_BASE_ADDRESS_MEM_TYPE_64 |
> PCI_BASE_ADDRESS_MEM_PREFETCH,
> memregion->u64_reg);
> if (copy_to_user(buf + copy_offset,
> (void *)&val64 + register_offset, copy_count))
> return -EFAULT;
> }
Yes thanks, will make the change.
>> +static ssize_t
>> +nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev,
>> + const char __user *buf, size_t count, loff_t *ppos)
>> +{
>> + struct nvgrace_gpu_vfio_pci_core_device *nvdev =
>> + container_of(core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
>> + core_device.vdev);
>> + u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> + size_t register_offset;
>> + loff_t copy_offset;
>> + size_t copy_count;
>> +
>> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(u64),
>> + ©_offset, ©_count,
>> + ®ister_offset)) {
>> + if (copy_from_user((void *)&nvdev->resmem.u64_reg + register_offset,
>> + buf + copy_offset, copy_count))
>> + return -EFAULT;
>> + *ppos += copy_count;
>> + return copy_count;
>> + }
>> +
>> + if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(u64),
>> + ©_offset, ©_count, ®ister_offset)) {
>> + if (copy_from_user((void *)&nvdev->usemem.u64_reg + register_offset,
>> + buf + copy_offset, copy_count))
>> + return -EFAULT;
>> + *ppos += copy_count;
>> + return copy_count;
>> + }
>
> Likewise:
>
> if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(u64),
> ©_offset, ©_count,
> ®ister_offset))
> memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(RESMEM_REGION_INDEX, nvdev);
> else if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(u64),
> ©_offset, ©_count, ®ister_offset)) {
> memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(USEMEM_REGION_INDEX, nvdev);
>
> if (memregion) {
> if (copy_from_user((void *)&memregion->u64_reg + register_offset,
> buf + copy_offset, copy_count))
> return -EFAULT;
>
> *ppos += copy_count;
> return copy_count;
> }
Ack.
>> +
>> + return vfio_pci_core_write(core_vdev, buf, count, ppos);
>> +}
>> +
>> +/*
>> + * Ad hoc map the device memory in the module kernel VA space. Primarily needed
>> + * to support Qemu's device x-no-mmap=on option.
>
> In general we try not to assume QEMU is the userspace driver. This
> certainly supports x-no-mmap=on in QEMU, but this is needed because
> vfio does not require the userspace driver to only perform accesses
> through mmaps of the vfio-pci BAR regions and existing userspace driver
> precedent requires read/write implementations.
Makes sense. Will rephrase it.
>> + *
>> + * The usemem region is cacheable memory and hence is memremaped.
>> + * The resmem region is non-cached and is mapped using ioremap_wc (NORMAL_NC).
>> + */
>> +static int
>> +nvgrace_gpu_map_device_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
>> + int index)
>> +{
>> + int ret = 0;
>> +
>> + mutex_lock(&nvdev->remap_lock);
>> + if (index == USEMEM_REGION_INDEX &&
>> + !nvdev->usemem.bar_remap.memaddr) {
>> + nvdev->usemem.bar_remap.memaddr =
>> + memremap(nvdev->usemem.memphys,
>> + nvdev->usemem.memlength, MEMREMAP_WB);
>> + if (!nvdev->usemem.bar_remap.memaddr)
>> + ret = -ENOMEM;
>> + } else if (index == RESMEM_REGION_INDEX &&
>> + !nvdev->resmem.bar_remap.ioaddr) {
>> + nvdev->resmem.bar_remap.ioaddr =
>> + ioremap_wc(nvdev->resmem.memphys,
>> + nvdev->resmem.memlength);
>> + if (!nvdev->resmem.bar_remap.ioaddr)
>> + ret = -ENOMEM;
>> + }
>
> With an anonymous union we could reduce a lot of the redundancy here:
>
> struct mem_region *memregion;
> int ret = 0;
>
> memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
> if (!memregion)
> return -EINVAL;
>
> mutex_lock(&nvdev->remap_lock);
>
> if (memregion->memaddr)
> goto unlock;
>
> if (index == USEMEM_REGION_INDEX)
> memregion->memaddr = memremap(memregion->memphys,
> memregion->memlength,
> MEMREMAP_WB);
> else
> memregion->ioaddr = ioremap_wc(memregion->memphys,
> memregion->memlength);
>
> if (!memregion->memaddr)
> ret = -ENOMEM;
>
> unlock:
> ...
Great suggestion, thanks. Will update.
> BTW, why does this function have args (nvdev, index) but
> nvgrace_gpu_vfio_pci_fake_bar_mem_region has args (index, nvdev)?
It shouldn't. Missed to maintain uniformity there.
> nvgrace_gpu_vfio_pci_fake_bar_mem_region could also be shorted to just
> nvgrace_gpu_memregion and I think we could use nvgrace_gpu in place of
> nvgrace_gpu_vfio_pci for function names throughout.
Ack.
>> +static int
>> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
>> + char __user *buf, size_t mem_count, loff_t *ppos)
>> +{
>> + unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> + u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
>> + int ret = 0;
>> +
>> + /*
>> + * Handle read on the BAR regions. Map to the target device memory
>> + * physical address and copy to the request read buffer.
>> + */
>> + ret = nvgrace_gpu_map_device_mem(nvdev, index);
>> + if (ret)
>> + return ret;
>> +
>
>
> This seems like a good place for a comment regarding COMMAND_MEM being
> ignored, especially since we're passing 'false' for test_mem in the
> second branch.
Good point. Will add the comment.
>> + *
>> + * A read from a negative or an offset greater than reported size, a negative
>> + * count are considered error conditions and returned with an -EINVAL.
>
> This needs some phrasing help, I can't parse.
Yeah, I'll itemize the error conditions to make it more readable.
>> +static int nvgrace_gpu_vfio_pci_probe(struct pci_dev *pdev,
>> + const struct pci_device_id *id)
>> +{
>> + struct nvgrace_gpu_vfio_pci_core_device *nvdev;
>> + int ret;
>> +
>> + nvdev = vfio_alloc_device(nvgrace_gpu_vfio_pci_core_device, core_device.vdev,
>> + &pdev->dev, &nvgrace_gpu_vfio_pci_ops);
>> + if (IS_ERR(nvdev))
>> + return PTR_ERR(nvdev);
>> +
>> + dev_set_drvdata(&pdev->dev, nvdev);
>> +
>> + ret = nvgrace_gpu_vfio_pci_fetch_memory_property(pdev, nvdev);
>> + if (ret)
>> + goto out_put_vdev;
>
> As a consequence of exposing the device differently in the host vs
> guest, we need to consider nested assignment here. The device table
> below will force userspace to select this driver for the device, but
> binding to it would fail because these bare metal properties are not
> present. We addressed this in the virtio-vfio-pci driver and decided
> the driver needs to support the device regardless of the availability
> of support for the legacy aspects of that driver. There's no protocol
> defined for userspace to pick a second best driver for a device.
>
> Therefore, like virtio-vfio-pci, this should be able to register a
> straight vfio-pci-core ops when these bare metal properties are not
> present.
Sounds reasonable, will make the change.
>> +struct mem_region {
>> + phys_addr_t memphys; /* Base physical address of the region */
>> + size_t memlength; /* Region size */
>> + __le64 u64_reg; /* Emulated BAR offset registers */
>
> s/u64_reg/bar_val/ ?
Fine with me.
> We could also include bar_size so we don't recalculate the power-of-2 size.
Sure.
Powered by blists - more mailing lists