linux-kernel - Re: drivers/pci: (and/or KVM): Slow PCI initialization during VM boot with passthrough of large BAR Nvidia GPUs on DGX H100

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAHTA-uZtRzFOuo7vZCjoLF3_n0CCy3+0U0r_deB3jFF0cPivnw@mail.gmail.com>
Date: Wed, 8 Jan 2025 17:06:18 -0600
From: Mitchell Augustin <mitchell.augustin@...onical.com>
To: Alex Williamson <alex.williamson@...hat.com>
Cc: linux-pci@...r.kernel.org, kvm@...r.kernel.org, 
	Bjorn Helgaas <bhelgaas@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: drivers/pci: (and/or KVM): Slow PCI initialization during VM boot
 with passthrough of large BAR Nvidia GPUs on DGX H100

Hi Alex,

While waiting for
https://lore.kernel.org/all/20241218224258.2225210-1-mitchell.augustin@canonical.com/
to be reviewed, I was thinking more about the slowness of
pci_write_config_<size>() itself in my use case.

You mentioned this earlier in the thread:

> It doesn't take into account that toggling the command register bit is not a trivial operation in a virtualized environment.

The thing that I don't understand about this is why the speed for this
toggle (an individual pci_write_config_*() call) would be different
for one passed-through GPU than for another. On one of my other
machines with a different GPU, I didn't see any PCI config register
write slowness during boot with passthrough. Notably, that other GPU
does have much less VRAM (and is not an Nvidia GPU). While scaling
issues due to larger GPU memory space would make sense to me if the
slowdown was in some function whose number of operations was bound by
device memory, it is unclear to me if that is relevant here, since as
far as I can tell, no such relationship exists in pci_write_config_*()
itself since it is just writing a single value to a single
configuration register regardless of the underlying platform. (It
appears entirely atomic, and only bound by how long it takes to
acquire the lock around the register.)  All I can hypothesize is that
maybe that lock acquisition needs to wait for some
hardware-implemented operation whose runtime is bound by memory size,
but that is just my best guess.

Is there anything you can think of that is triggered by the
pci_write_config_*() alone that you think might cause device-dependent
behavior here, or is this likely something that I will just need to
raise with Nvidia?

Thanks,
Mitchell Augustin

On Thu, Dec 5, 2024 at 6:09 PM Mitchell Augustin
<mitchell.augustin@...onical.com> wrote:
>
> I submitted a patch that addresses this issue that I want to link to
> in this thread:
> https://lore.kernel.org/all/20241206000351.884656-1-mitchell.augustin@canonical.com/
> - everything looks good with it on my end.
>
> -Mitchell Augustin
>
>
> On Tue, Dec 3, 2024 at 5:30 PM Alex Williamson
> <alex.williamson@...hat.com> wrote:
> >
> > On Tue, 3 Dec 2024 17:09:07 -0600
> > Mitchell Augustin <mitchell.augustin@...onical.com> wrote:
> >
> > > Thanks for the suggestions!
> > >
> > > > The calling convention of __pci_read_base() is already changing if we're having the caller disable decoding
> > >
> > > The way I implemented that in my initial patch draft[0] still allows
> > > for __pci_read_base() to be called independently, as it was
> > > originally, since (as far as I understand) the encode disable/enable
> > > is just a mask - so I didn't need to remove the disable/enable inside
> > > __pci_read_base(), and instead just added an extra one in
> > > pci_read_bases(), turning the __pci_read_base() disable/enable into a
> > > no-op when called from pci_read_bases(). In any case...
> > >
> > > > I think maybe another alternative that doesn't hold off the console would be to split the BAR sizing and resource processing into separate steps.
> > >
> > > This seems like a potentially better option, so I'll dig into that approach.
> > >
> > >
> > > Providing some additional info you requested last week, just for more context:
> > >
> > > > Do you have similar logs from that [hotplug] operation
> > >
> > > Attached [1] is the guest boot output (boot was quick, since no GPUs
> > > were attached at boot time)
> >
> > I think what's happening here is that decode is already disabled on the
> > hot-added device (vs enabled by the VM firmware on cold-plug), so in
> > practice it's similar to your nested disable solution.  Thanks,
> >
> > Alex
> >
>
>
> --
> Mitchell Augustin
> Software Engineer - Ubuntu Partner Engineering

-- 
Mitchell Augustin
Software Engineer - Ubuntu Partner Engineering