linux-kernel - Re: drivers/pci: (and/or KVM): Slow PCI initialization during VM boot with passthrough of large BAR Nvidia GPUs on DGX H100

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAHTA-uYDffh1GkPsi-UQcMV4qskN2aT+PhqNENWpUUspPB7uaw@mail.gmail.com>
Date: Mon, 13 Jan 2025 13:43:30 -0600
From: Mitchell Augustin <mitchell.augustin@...onical.com>
To: Alex Williamson <alex.williamson@...hat.com>
Cc: linux-pci@...r.kernel.org, kvm@...r.kernel.org, 
	Bjorn Helgaas <bhelgaas@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: drivers/pci: (and/or KVM): Slow PCI initialization during VM boot
 with passthrough of large BAR Nvidia GPUs on DGX H100

Thank you Alex, that makes more sense now.

> Potentially the huge pfnmap support that we've introduced in v6.12 can help us here if we're faulting the mappings on PUD or PMD levels, then we should be able to insert the same size mappings into the IOMMU.  I'm hoping we can begin to make such optimizations now.

On November 26th, I did try my reproducer out with the 6.12 kernel in
my guest and host with qemu-9.2.0-rc1, and I did not see any
improvement in the PCI init time when my devices were attached during
boot. Just to make sure I understand, are you saying that there are
still some steps yet to be implemented before the huge pfnmap support
would show gains with respect to this issue, or should I have
theoretically seen that improvement with my prior test?

-Mitchell

On Mon, Jan 13, 2025 at 12:22 PM Alex Williamson
<alex.williamson@...hat.com> wrote:
>
> On Wed, 8 Jan 2025 17:06:18 -0600
> Mitchell Augustin <mitchell.augustin@...onical.com> wrote:
>
> > Hi Alex,
> >
> > While waiting for
> > https://lore.kernel.org/all/20241218224258.2225210-1-mitchell.augustin@canonical.com/
> > to be reviewed, I was thinking more about the slowness of
> > pci_write_config_<size>() itself in my use case.
> >
> > You mentioned this earlier in the thread:
> >
> > > It doesn't take into account that toggling the command register bit is not a trivial operation in a virtualized environment.
> >
> > The thing that I don't understand about this is why the speed for this
> > toggle (an individual pci_write_config_*() call) would be different
> > for one passed-through GPU than for another. On one of my other
> > machines with a different GPU, I didn't see any PCI config register
> > write slowness during boot with passthrough. Notably, that other GPU
> > does have much less VRAM (and is not an Nvidia GPU). While scaling
> > issues due to larger GPU memory space would make sense to me if the
> > slowdown was in some function whose number of operations was bound by
> > device memory, it is unclear to me if that is relevant here, since as
> > far as I can tell, no such relationship exists in pci_write_config_*()
> > itself since it is just writing a single value to a single
> > configuration register regardless of the underlying platform. (It
> > appears entirely atomic, and only bound by how long it takes to
> > acquire the lock around the register.)  All I can hypothesize is that
> > maybe that lock acquisition needs to wait for some
> > hardware-implemented operation whose runtime is bound by memory size,
> > but that is just my best guess.
> >
> > Is there anything you can think of that is triggered by the
> > pci_write_config_*() alone that you think might cause device-dependent
> > behavior here, or is this likely something that I will just need to
> > raise with Nvidia?
>
> The slowness is proportional to the size of the device MMIO address
> space.  In QEMU, follow the path of pci_default_write_config().  It's
> not simply the config space write, but the fact that the config space
> write needs to populate the device memory into the guest address space.
> On memory_region_transaction_commit() affected memory listeners are
> called, for vfio this is vfio_listener_region_add().  At this point the
> device MMIO space is being added to the system_memory address space.
> Without a vIOMMU, devices also operate in this same address space,
> therefore the MMIO regions of the device need to be DMA mapped through
> the IOMMU.  This is where I expect we have the bulk of the overhead as
> we iterate the pfnmaps and insert the IOMMU page tables.
>
> Potentially the huge pfnmap support that we've introduced in v6.12 can
> help us here if we're faulting the mappings on PUD or PMD levels, then
> we should be able to insert the same size mappings into the IOMMU.  I'm
> hoping we can begin to make such optimizations now.  Thanks,
>
> Alex
>


-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering