linux-kernel - Re: [PATCH v16 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <SA1PR12MB719904680D5961E6F1806F12B0702@SA1PR12MB7199.namprd12.prod.outlook.com>
Date: Fri, 19 Jan 2024 03:33:19 +0000
From: Ankit Agrawal <ankita@...dia.com>
To: Alex Williamson <alex.williamson@...hat.com>
CC: Jason Gunthorpe <jgg@...dia.com>, Yishai Hadas <yishaih@...dia.com>,
	"shameerali.kolothum.thodi@...wei.com"
	<shameerali.kolothum.thodi@...wei.com>, "kevin.tian@...el.com"
	<kevin.tian@...el.com>, "eric.auger@...hat.com" <eric.auger@...hat.com>,
	"brett.creeley@....com" <brett.creeley@....com>, "horms@...nel.org"
	<horms@...nel.org>, Aniket Agashe <aniketa@...dia.com>, Neo Jia
	<cjia@...dia.com>, Kirti Wankhede <kwankhede@...dia.com>, "Tarun Gupta
 (SW-GPU)" <targupta@...dia.com>, Vikram Sethi <vsethi@...dia.com>, Andy
 Currid <acurrid@...dia.com>, Alistair Popple <apopple@...dia.com>, John
 Hubbard <jhubbard@...dia.com>, Dan Williams <danw@...dia.com>, "Anuj Aggarwal
 (SW-GPU)" <anuaggarwal@...dia.com>, Matt Ochs <mochs@...dia.com>,
	"kvm@...r.kernel.org" <kvm@...r.kernel.org>, "linux-kernel@...r.kernel.org"
	<linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v16 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for
 grace hopper

>> Signed-off-by: Ankit Agrawal <ankita@...dia.com>
>> Signed-off-by: Aniket Agashe <aniketa@...dia.com>
>> Tested-by: Ankit Agrawal <ankita@...dia.com>
>
> Dunno about others, but I sure hope and assume the author tests ;)
> Sometimes I'm proven wrong.

Yeah, does not hurt to keep it then I suppose. :)

>> +
>> +#include "nvgrace_gpu_vfio_pci.h"
>
> Could probably just shorten this to nvgrace_gpu.h, but with just a
> single source file, we don't need a separate header.  Put it inline here.

Ack.

>> +
>> +/* Choose the structure corresponding to the fake BAR with a given index. */
>> +struct mem_region *
>
> static

Yes.

>> +      *
>> +      * The available GPU memory size may not be power-of-2 aligned. Map up
>> +      * to the size of the device memory. If the memory access is beyond the
>> +      * actual GPU memory size, it will be handled by the vfio_device_ops
>> +      * read/write.
>
> The phrasing "[m]ap up to the size" suggests the behavior of previous
> versions where we'd truncate mappings.  Maybe something like:
>
>        * The available GPU memory size may not be power-of-2 aligned.
>        * The remainder is only backed by read/write handlers.

Got it. Will fix.

>> +
>> +     if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
>> +                               &copy_offset, &copy_count,
>> +                               &register_offset)) {
>> +             val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(nvdev->resmem.memlength),
>> +                                                PCI_BASE_ADDRESS_MEM_TYPE_64 |
>> +                                                PCI_BASE_ADDRESS_MEM_PREFETCH,
>> +                                                nvdev->resmem.u64_reg);
>> +             if (copy_to_user(buf + copy_offset,
>> +                              (void *)&val64 + register_offset, copy_count))
>> +                     return -EFAULT;
>> +     }
>> +
>> +     if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
>> +                               &copy_offset, &copy_count,
>> +                               &register_offset)) {
>> +             val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(nvdev->usemem.memlength),
>> +                                                PCI_BASE_ADDRESS_MEM_TYPE_64 |
>> +                                                PCI_BASE_ADDRESS_MEM_PREFETCH,
>> +                                                nvdev->usemem.u64_reg);
>> +             if (copy_to_user(buf + copy_offset,
>> +                              (void *)&val64 + register_offset, copy_count))
>> +                     return -EFAULT;
>> +     }
>
> Both read and write could be simplified a bit:
>
>       if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(val64),
>                                  &copy_offset, &copy_count,
>                                  &register_offset))
>                memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(RESMEM_REGION_INDEX, nvdev);
>        else if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(val64),
>                                  &copy_offset, &copy_count,
>                                  &register_offset))
>                memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(USEMEM_REGION_INDEX, nvdev);
>
>        if (memregion) {
>                val64 = nvgrace_gpu_get_read_value(roundup_pow_of_two(memregion->memlength),
>                                                   PCI_BASE_ADDRESS_MEM_TYPE_64 |
>                                                   PCI_BASE_ADDRESS_MEM_PREFETCH,
>                                                   memregion->u64_reg);
>                if (copy_to_user(buf + copy_offset,
>                                 (void *)&val64 + register_offset, copy_count))
>                        return -EFAULT;
>        }

Yes thanks, will make the change. 

>> +static ssize_t
>> +nvgrace_gpu_write_config_emu(struct vfio_device *core_vdev,
>> +                          const char __user *buf, size_t count, loff_t *ppos)
>> +{
>> +     struct nvgrace_gpu_vfio_pci_core_device *nvdev =
>> +             container_of(core_vdev, struct nvgrace_gpu_vfio_pci_core_device,
>> +                          core_device.vdev);
>> +     u64 pos = *ppos & VFIO_PCI_OFFSET_MASK;
>> +     size_t register_offset;
>> +     loff_t copy_offset;
>> +     size_t copy_count;
>> +
>> +     if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(u64),
>> +                               &copy_offset, &copy_count,
>> +                               &register_offset)) {
>> +             if (copy_from_user((void *)&nvdev->resmem.u64_reg + register_offset,
>> +                                buf + copy_offset, copy_count))
>> +                     return -EFAULT;
>> +             *ppos += copy_count;
>> +             return copy_count;
>> +     }
>> +
>> +     if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(u64),
>> +                               &copy_offset, &copy_count, &register_offset)) {
>> +             if (copy_from_user((void *)&nvdev->usemem.u64_reg + register_offset,
>> +                                buf + copy_offset, copy_count))
>> +                     return -EFAULT;
>> +             *ppos += copy_count;
>> +             return copy_count;
>> +     }
>
> Likewise:
>
>        if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_2, sizeof(u64),
>                                  &copy_offset, &copy_count,
>                                  &register_offset))
>                memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(RESMEM_REGION_INDEX, nvdev);
>        else if (range_intersect_range(pos, count, PCI_BASE_ADDRESS_4, sizeof(u64),
>                                  &copy_offset, &copy_count, &register_offset)) {
>                memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(USEMEM_REGION_INDEX, nvdev);
>
>        if (memregion) {
>                if (copy_from_user((void *)&memregion->u64_reg + register_offset,
>                                   buf + copy_offset, copy_count))
>                        return -EFAULT;
>
>                *ppos += copy_count;
>                return copy_count;
>        }

Ack.

>> +
>> +     return vfio_pci_core_write(core_vdev, buf, count, ppos);
>> +}
>> +
>> +/*
>> + * Ad hoc map the device memory in the module kernel VA space. Primarily needed
>> + * to support Qemu's device x-no-mmap=on option.
>
> In general we try not to assume QEMU is the userspace driver.  This
> certainly supports x-no-mmap=on in QEMU, but this is needed because
> vfio does not require the userspace driver to only perform accesses
> through mmaps of the vfio-pci BAR regions and existing userspace driver
> precedent requires read/write implementations.

Makes sense. Will rephrase it.

>> + *
>> + * The usemem region is cacheable memory and hence is memremaped.
>> + * The resmem region is non-cached and is mapped using ioremap_wc (NORMAL_NC).
>> + */
>> +static int
>> +nvgrace_gpu_map_device_mem(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
>> +                        int index)
>> +{
>> +     int ret = 0;
>> +
>> +     mutex_lock(&nvdev->remap_lock);
>> +     if (index == USEMEM_REGION_INDEX &&
>> +         !nvdev->usemem.bar_remap.memaddr) {
>> +             nvdev->usemem.bar_remap.memaddr =
>> +                     memremap(nvdev->usemem.memphys,
>> +                              nvdev->usemem.memlength, MEMREMAP_WB);
>> +             if (!nvdev->usemem.bar_remap.memaddr)
>> +                     ret = -ENOMEM;
>> +     } else if (index == RESMEM_REGION_INDEX &&
>> +             !nvdev->resmem.bar_remap.ioaddr) {
>> +             nvdev->resmem.bar_remap.ioaddr =
>> +                     ioremap_wc(nvdev->resmem.memphys,
>> +                                nvdev->resmem.memlength);
>> +             if (!nvdev->resmem.bar_remap.ioaddr)
>> +                     ret = -ENOMEM;
>> +     }
>
> With an anonymous union we could reduce a lot of the redundancy here:
>
>        struct mem_region *memregion;
>        int ret = 0;
>
>        memregion = nvgrace_gpu_vfio_pci_fake_bar_mem_region(index, nvdev);
>        if (!memregion)
>                return -EINVAL;
>
>        mutex_lock(&nvdev->remap_lock);
>
>        if (memregion->memaddr)
>                goto unlock;
>
>        if (index == USEMEM_REGION_INDEX)
>                memregion->memaddr = memremap(memregion->memphys,
>                                              memregion->memlength,
>                                              MEMREMAP_WB);
>        else
>                memregion->ioaddr = ioremap_wc(memregion->memphys,
>                                               memregion->memlength);
>
>        if (!memregion->memaddr)
>                ret = -ENOMEM;
>
> unlock:
>        ...

Great suggestion, thanks. Will update.

> BTW, why does this function have args (nvdev, index) but
> nvgrace_gpu_vfio_pci_fake_bar_mem_region has args (index, nvdev)?

It shouldn't. Missed to maintain uniformity there.

> nvgrace_gpu_vfio_pci_fake_bar_mem_region could also be shorted to just
> nvgrace_gpu_memregion and I think we could use nvgrace_gpu in place of
> nvgrace_gpu_vfio_pci for function names throughout.

Ack.

>> +static int
>> +nvgrace_gpu_map_and_read(struct nvgrace_gpu_vfio_pci_core_device *nvdev,
>> +                      char __user *buf, size_t mem_count, loff_t *ppos)
>> +{
>> +     unsigned int index = VFIO_PCI_OFFSET_TO_INDEX(*ppos);
>> +     u64 offset = *ppos & VFIO_PCI_OFFSET_MASK;
>> +     int ret = 0;
>> +
>> +     /*
>> +      * Handle read on the BAR regions. Map to the target device memory
>> +      * physical address and copy to the request read buffer.
>> +      */
>> +     ret = nvgrace_gpu_map_device_mem(nvdev, index);
>> +     if (ret)
>> +             return ret;
>> +
>
>
> This seems like a good place for a comment regarding COMMAND_MEM being
> ignored, especially since we're passing 'false' for test_mem in the
> second branch.

Good point. Will add the comment.

>> + *
>> + * A read from a negative or an offset greater than reported size, a negative
>> + * count are considered error conditions and returned with an -EINVAL.
>
> This needs some phrasing help, I can't parse.

Yeah, I'll itemize the error conditions to make it more readable.

>> +static int nvgrace_gpu_vfio_pci_probe(struct pci_dev *pdev,
>> +                                   const struct pci_device_id *id)
>> +{
>> +     struct nvgrace_gpu_vfio_pci_core_device *nvdev;
>> +     int ret;
>> +
>> +     nvdev = vfio_alloc_device(nvgrace_gpu_vfio_pci_core_device, core_device.vdev,
>> +                               &pdev->dev, &nvgrace_gpu_vfio_pci_ops);
>> +     if (IS_ERR(nvdev))
>> +             return PTR_ERR(nvdev);
>> +
>> +     dev_set_drvdata(&pdev->dev, nvdev);
>> +
>> +     ret = nvgrace_gpu_vfio_pci_fetch_memory_property(pdev, nvdev);
>> +     if (ret)
>> +             goto out_put_vdev;
>
> As a consequence of exposing the device differently in the host vs
> guest, we need to consider nested assignment here.  The device table
> below will force userspace to select this driver for the device, but
> binding to it would fail because these bare metal properties are not
> present.  We addressed this in the virtio-vfio-pci driver and decided
> the driver needs to support the device regardless of the availability
> of support for the legacy aspects of that driver.  There's no protocol
> defined for userspace to pick a second best driver for a device.
>
> Therefore, like virtio-vfio-pci, this should be able to register a
> straight vfio-pci-core ops when these bare metal properties are not
> present.

Sounds reasonable, will make the change.

>> +struct mem_region {
>> +     phys_addr_t memphys;    /* Base physical address of the region */
>> +     size_t memlength;       /* Region size */
>> +     __le64 u64_reg;         /* Emulated BAR offset registers */
>
> s/u64_reg/bar_val/ ?

Fine with me.

> We could also include bar_size so we don't recalculate the power-of-2 size.

Sure.