[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YmgheiPOApuiLcK6@lpieralisi>
Date: Tue, 26 Apr 2022 17:44:42 +0100
From: Lorenzo Pieralisi <lorenzo.pieralisi@....com>
To: Dexuan Cui <decui@...rosoft.com>
Cc: Jake Oshins <jakeo@...rosoft.com>,
Bjorn Helgaas <helgaas@...nel.org>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>,
Alex Williamson <alex.williamson@...hat.com>,
"wei.liu@...nel.org" <wei.liu@...nel.org>,
KY Srinivasan <kys@...rosoft.com>,
Haiyang Zhang <haiyangz@...rosoft.com>,
Stephen Hemminger <sthemmin@...rosoft.com>,
"linux-hyperv@...r.kernel.org" <linux-hyperv@...r.kernel.org>,
"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"Michael Kelley (LINUX)" <mikelley@...rosoft.com>,
"robh@...nel.org" <robh@...nel.org>, "kw@...ux.com" <kw@...ux.com>,
"kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: Re: [PATCH] PCI: hv: Do not set PCI_COMMAND_MEMORY to reduce VM boot
time
On Thu, Apr 21, 2022 at 06:26:36PM +0000, Dexuan Cui wrote:
> > From: Jake Oshins <jakeo@...rosoft.com>
> > Sent: Wednesday, April 20, 2022 9:01 AM
> > To: Bjorn Helgaas <helgaas@...nel.org>; Dexuan Cui <decui@...rosoft.com>
> > Cc: wei.liu@...nel.org; KY Srinivasan <kys@...rosoft.com>; Haiyang Zhang
> > <haiyangz@...rosoft.com>; Stephen Hemminger <sthemmin@...rosoft.com>;
> > lorenzo.pieralisi@....com; bhelgaas@...gle.com;
> > linux-hyperv@...r.kernel.org; linux-pci@...r.kernel.org;
> > linux-kernel@...r.kernel.org; Michael Kelley (LINUX)
> > <mikelley@...rosoft.com>; robh@...nel.org; kw@...ux.com; Alex Williamson
> > <alex.williamson@...hat.com>; .kvm@...r.kernel.org
> > Subject: RE: [EXTERNAL] Re: [PATCH] PCI: hv: Do not set
> > PCI_COMMAND_MEMORY to reduce VM boot time
>
> I removed the "[EXTERNAL]" tag as it looks like that prevents the email from being
> archived properly. See this link for the archived email thread:
> https://lwn.net/ml/linux-kernel/20220419220007.26550-1-decui%40microsoft.com/
>
> > > -----Original Message-----
> > > From: Bjorn Helgaas <helgaas@...nel.org>
> > > Sent: Wednesday, April 20, 2022 8:36 AM
> > > To: Dexuan Cui <decui@...rosoft.com>
> > > Cc: wei.liu@...nel.org; KY Srinivasan <kys@...rosoft.com>; Haiyang Zhang
> > > <haiyangz@...rosoft.com>; Stephen Hemminger
> > <sthemmin@...rosoft.com>;
> > > lorenzo.pieralisi@....com; bhelgaas@...gle.com; linux-
> > > hyperv@...r.kernel.org; linux-pci@...r.kernel.org; linux-
> > > kernel@...r.kernel.org; Michael Kelley (LINUX) <mikelley@...rosoft.com>;
> > > robh@...nel.org; kw@...ux.com; Jake Oshins <jakeo@...rosoft.com>; Alex
> > > Williamson <alex.williamson@...hat.com>; .kvm@...r.kernel.org
>
> I removed the period from ".kvm@" so the KVM list is correctly Cc'd.
>
> > > Subject: [EXTERNAL] Re: [PATCH] PCI: hv: Do not set
> > PCI_COMMAND_MEMORY
> > > to reduce VM boot time
> > >
> > > [+cc Alex, kvm in case this idea is applicable for more than just Hyper-V]
>
> Alex, your comments are welcome!
>
> > > On Tue, Apr 19, 2022 at 03:00:07PM -0700, Dexuan Cui wrote:
> > > > A VM on Azure can have 14 GPUs, and each GPU may have a huge MMIO
> > > BAR,
> > > > e.g. 128 GB. Currently the boot time of such a VM can be 4+ minutes,
> > > > and most of the time is used by the host to unmap/map the vBAR from/to
> > > > pBAR when the VM clears and sets the PCI_COMMAND_MEMORY bit: each
> > > > unmap/map operation for a 128GB BAR needs about 1.8 seconds, and the
> > > > pci-hyperv driver and the Linux PCI subsystem flip the
> > > > PCI_COMMAND_MEMORY bit eight times (see pci_setup_device() ->
> > > > pci_read_bases() and pci_std_update_resource()), increasing the boot
> > > > time by 1.8 * 8 = 14.4 seconds per GPU, i.e. 14.4 * 14 = 201.6 seconds in
> > total.
> > > >
> > > > Fix the slowness by not turning on the PCI_COMMAND_MEMORY in
> > > > pci-hyperv.c, so the bit stays in the off state before the PCI device
> > > > driver calls
> > > > pci_enable_device(): when the bit is off, pci_read_bases() and
> > > > pci_std_update_resource() don't cause Hyper-V to unmap/map the vBARs.
> > > > With this change, the boot time of such a VM is reduced by
> > > > 1.8 * (8-1) * 14 = 176.4 seconds.
> > > >
> > > > Tested-by: Boqun Feng (Microsoft) <boqun.feng@...il.com>
> > > > Signed-off-by: Dexuan Cui <decui@...rosoft.com>
> > > > Cc: Jake Oshins <jakeo@...rosoft.com>
> > > > ---
> > > > drivers/pci/controller/pci-hyperv.c | 17 +++++++++++------
> > > > 1 file changed, 11 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/drivers/pci/controller/pci-hyperv.c
> > > > b/drivers/pci/controller/pci-hyperv.c
> > > > index d270a204324e..f9fbbd8d94db 100644
> > > > --- a/drivers/pci/controller/pci-hyperv.c
> > > > +++ b/drivers/pci/controller/pci-hyperv.c
> > > > @@ -2082,12 +2082,17 @@ static void prepopulate_bars(struct
> > > hv_pcibus_device *hbus)
> > > > }
> > > > }
> > > > if (high_size <= 1 && low_size <= 1) {
> > > > - /* Set the memory enable bit. */
> > > > - _hv_pcifront_read_config(hpdev,
> > > PCI_COMMAND, 2,
> > > > - &command);
> > > > - command |= PCI_COMMAND_MEMORY;
> > > > - _hv_pcifront_write_config(hpdev,
> > > PCI_COMMAND, 2,
> > > > - command);
> > > > + /*
> > > > + * No need to set the
> > > PCI_COMMAND_MEMORY bit as
> > > > + * the core PCI driver doesn't require the bit
> > > > + * to be pre-set. Actually here we intentionally
> > > > + * keep the bit off so that the PCI BAR probing
> > > > + * in the core PCI driver doesn't cause Hyper-V
> > > > + * to unnecessarily unmap/map the virtual
> > > BARs
> > > > + * from/to the physical BARs multiple times.
> > > > + * This reduces the VM boot time significantly
> > > > + * if the BAR sizes are huge.
> > > > + */
> > > > break;
> > > > }
> > > > }
> > > > --
> > > > 2.17.1
> > > >
> >
> > My question about this, directed at the people who know a lot more about the
> > PCI subsystem in Linux than I do (Bjorn, Alex, etc.), is whether this change can
> > create a problem.
>
> Bjorn, Lorenzo, Alex, it would be nice to have your insights!
>
> > In a physical computer, you might sometimes want to move
> > a device from one part of address space to another, typically when another
> > device is being hot-added to the system. Imagine a scenario where you have,
> > in this case a VM, and there are perhaps 15 NVMe SSDs passed through to the
> > VM. One of them dies and so it is removed. The replacement SSD has a
> > larger BAR than the one that was removed (because it's a different SKU) and
> > now it doesn't fit neatly into the hole that was vacated by the previous SSD.
> >
> > In this case, will the kernel ever attempt to move some of the existing SSDs to
> > make room for the new one? And if so, does this come with the requirement
> > that they actually get mapped out of the address space while they are being
> > moved? (This was the scenario that prompted me to write the code above as
> > it was, originally.)
> >
> > Thanks,
> > Jake Oshins
>
> Sorry I don't quite follow. pci-hyperv allocates MMIO for the bridge
> window in hv_pci_allocate_bridge_windows() and registers the MMIO
> ranges to the core PCI driver via pci_add_resource(), and later the
> core PCI driver probes the bus/device(s), validates the BAR sizes and
> the pre-initialized BAR values, and uses the BAR configuration. IMO
> the whole process doesn't require the bit PCI_COMMAND_MEMORY to be
> pre-set, and there should be no issue to delay setting the bit to a
> PCI device device's .probe() -> pci_enable_device().
IIUC you want to bootstrap devices with PCI_COMMAND_MEMORY clear
(otherwise PCI core would toggle it on and off for eg BAR sizing).
Is that correct ?
If I read PCI core correctly PCI_COMMAND_MEMORY is obviously cleared
only if it is set in the first place and that's what your patch is
changing, namely you boostrap your devices with PCI_COMMAND_MEMORY
clear so that PCI core does not touch it.
I am not familiar with Hyper-V code so forgive me the question.
Thanks,
Lorenzo
>
> When a device is removed, hv_eject_device_work() ->
> pci_stop_and_remove_bus_device() triggers the PCI device driver's .remove(),
> which calls pci_disable_device(), which instructs the hypervisor to unmap
> the vBAR from the pBAR, so there should not be any issue when a new device
> is added later.
>
> With the patch, if no PCI device driver is loaded, the PCI_COMMAND_MEMORY
> bit stays in the off state, and the hypervisor doesn't map the vBAR to the pBAR.
> I don't see any issue with this.
>
> hv_pci_allocate_bridge_windows() -> vmbus_allocate_mmio() makes sure
> that there is enough MMIO space for the devices on the bus, so the core
> PCI driver doesn't have to worry about any MMIO conflict issue.
>
> Please let me know if I understood and answered the questions.
>
> Thanks,
> -- Dexuan
>
Powered by blists - more mailing lists