[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241126170214.5717003f.alex.williamson@redhat.com>
Date: Tue, 26 Nov 2024 17:02:14 -0700
From: Alex Williamson <alex.williamson@...hat.com>
To: Mitchell Augustin <mitchell.augustin@...onical.com>
Cc: linux-pci@...r.kernel.org, kvm@...r.kernel.org, Bjorn Helgaas
<bhelgaas@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: drivers/pci: (and/or KVM): Slow PCI initialization during VM
boot with passthrough of large BAR Nvidia GPUs on DGX H100
On Tue, 26 Nov 2024 17:08:07 -0600
Mitchell Augustin <mitchell.augustin@...onical.com> wrote:
> > If the slowness is confined to the guest kernel boot, can you share the log of that boot with timestamps?
>
> It is confined to the guest. On hand, I have this log section from a
> past guest boot:
>
> [ 3.965790] pci 0000:00:00.0: [8086:29c0] type 00 class 0x060000
> conventional PCI endpoint
> [ 3.967800] pci 0000:00:01.0: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 3.974590] pci 0000:00:01.0: BAR 0 [mem 0x81c8e000-0x81c8efff]
> [ 3.980780] pci 0000:00:01.0: PCI bridge to [bus 01]
> [ 3.984128] pci 0000:00:01.0: bridge window [mem 0x81a00000-0x81bfffff]
> [ 3.985915] pci 0000:00:01.0: bridge window [mem
> 0x38e800000000-0x38efffffffff 64bit pref]
> [ 3.988884] pci 0000:00:01.1: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 3.992771] pci 0000:00:01.1: BAR 0 [mem 0x81c8d000-0x81c8dfff]
> [ 3.999361] pci 0000:00:01.1: PCI bridge to [bus 02]
> [ 4.000276] pci 0000:00:01.1: bridge window [mem 0x81800000-0x819fffff]
> [ 4.001849] pci 0000:00:01.1: bridge window [mem
> 0x38f000000000-0x38f7ffffffff 64bit pref]
> [ 4.009442] pci 0000:00:01.2: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 4.012772] pci 0000:00:01.2: BAR 0 [mem 0x81c8c000-0x81c8cfff]
> [ 4.016781] pci 0000:00:01.2: PCI bridge to [bus 03]
> [ 4.020299] pci 0000:00:01.2: bridge window [mem 0x81600000-0x817fffff]
> [ 4.021780] pci 0000:00:01.2: bridge window [mem
> 0x38f800000000-0x38ffffffffff 64bit pref]
> [ 4.024680] pci 0000:00:01.3: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 4.027299] pci 0000:00:01.3: BAR 0 [mem 0x81c8b000-0x81c8bfff]
> [ 4.036781] pci 0000:00:01.3: PCI bridge to [bus 04]
> [ 4.037646] pci 0000:00:01.3: bridge window [mem 0x81400000-0x815fffff]
> [ 4.040806] pci 0000:00:01.3: bridge window [mem
> 0x390000000000-0x3907ffffffff 64bit pref]
> [ 4.046317] pci 0000:00:01.4: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 4.049673] pci 0000:00:01.4: BAR 0 [mem 0x81c8a000-0x81c8afff]
> [ 4.053625] pci 0000:00:01.4: PCI bridge to [bus 05]
> [ 4.056830] pci 0000:00:01.4: bridge window [mem 0x81200000-0x813fffff]
> [ 4.059022] pci 0000:00:01.4: bridge window [mem
> 0x390800000000-0x390fffffffff 64bit pref]
> [ 4.061614] pci 0000:00:01.5: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
The above line is printed from pci_setup_device() which calls
pci_read_bases(dev, 2, PCI_ROM_ADDRESS1);
That alone calls __pci_read_base() three separate times, each time
disabling and re-enabling decode on the bridge.
> [ 7.616784] pci 0000:00:01.5: BAR 0 [mem 0x81c89000-0x81c89fff]
I think this is printed at the end of the first call to
__pci_read_base(), so nearly 3.5s once through. Twice again for about
another 7s puts us at 14+s below.
> [ 14.140789] pci 0000:00:01.5: PCI bridge to [bus 06]
> [ 14.142240] pci 0000:00:01.5: bridge window [mem 0x81000000-0x811fffff]
> [ 17.432805] pci 0000:00:01.5: bridge window [mem
> 0x380000000000-0x382002ffffff 64bit pref]
I think the prefetchable window on the bridge gets disabled and
re-enabled here, that's another.
So that's 4 times per calling pci_setup_device() on the bridge that
we're re-mapping the device's 128G BAR back into the VM address space,
that's 12-14s, times 4 GPUs and we're already approaching a minute.
> [ 17.436775] pci 0000:00:01.6: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 20.656786] pci 0000:00:01.6: BAR 0 [mem 0x81c88000-0x81c88fff]
> [ 27.044792] pci 0000:00:01.6: PCI bridge to [bus 07]
> [ 27.048775] pci 0000:00:01.6: bridge window [mem 0x80e00000-0x80ffffff]
> [ 30.240811] pci 0000:00:01.6: bridge window [mem
> 0x384000000000-0x386002ffffff 64bit pref]
> [ 30.244777] pci 0000:00:01.7: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 33.676789] pci 0000:00:01.7: BAR 0 [mem 0x81c87000-0x81c87fff]
> [ 40.524791] pci 0000:00:01.7: PCI bridge to [bus 08]
> [ 40.526249] pci 0000:00:01.7: bridge window [mem 0x80c00000-0x80dfffff]
> [ 43.876807] pci 0000:00:01.7: bridge window [mem
> 0x388000000000-0x38a002ffffff 64bit pref]
> [ 43.880776] pci 0000:00:02.0: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 47.184786] pci 0000:00:02.0: BAR 0 [mem 0x81c86000-0x81c86fff]
> [ 53.796794] pci 0000:00:02.0: PCI bridge to [bus 09]
> [ 53.798349] pci 0000:00:02.0: bridge window [mem 0x80a00000-0x80bfffff]
> [ 57.104809] pci 0000:00:02.0: bridge window [mem
> 0x38c000000000-0x38e002ffffff 64bit pref]
> [ 57.109300] pci 0000:00:02.1: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 57.112270] pci 0000:00:02.1: BAR 0 [mem 0x81c85000-0x81c85fff]
> [ 57.115960] pci 0000:00:02.1: PCI bridge to [bus 0a]
> [ 57.116535] pci 0000:00:02.1: bridge window [mem 0x80800000-0x809fffff]
> [ 57.121443] pci 0000:00:02.1: bridge window [mem
> 0x391000000000-0x3917ffffffff 64bit pref]
> [ 57.123113] pci 0000:00:02.2: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 57.125263] pci 0000:00:02.2: BAR 0 [mem 0x81c84000-0x81c84fff]
> [ 57.128168] pci 0000:00:02.2: PCI bridge to [bus 0b]
> [ 57.128713] pci 0000:00:02.2: bridge window [mem 0x80600000-0x807fffff]
> [ 57.129619] pci 0000:00:02.2: bridge window [mem
> 0x391800000000-0x391fffffffff 64bit pref]
> [ 57.133671] pci 0000:00:02.3: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 57.135884] pci 0000:00:02.3: BAR 0 [mem 0x81c83000-0x81c83fff]
> [ 57.138198] pci 0000:00:02.3: PCI bridge to [bus 0c]
> [ 57.138749] pci 0000:00:02.3: bridge window [mem 0x80400000-0x805fffff]
> [ 57.139975] pci 0000:00:02.3: bridge window [mem
> 0x392000000000-0x3927ffffffff 64bit pref]
> [ 57.141558] pci 0000:00:02.4: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 57.146325] pci 0000:00:02.4: BAR 0 [mem 0x81c82000-0x81c82fff]
> [ 57.148744] pci 0000:00:02.4: PCI bridge to [bus 0d]
> [ 57.148795] pci 0000:00:02.4: bridge window [mem 0x80200000-0x803fffff]
> [ 57.149992] pci 0000:00:02.4: bridge window [mem
> 0x392800000000-0x392fffffffff 64bit pref]
> [ 57.151660] pci 0000:00:02.5: [1b36:000c] type 01 class 0x060400
> PCIe Root Port
> [ 57.157365] pci 0000:00:02.5: BAR 0 [mem 0x81c81000-0x81c81fff]
> [ 57.159859] pci 0000:00:02.5: PCI bridge to [bus 0e]
> [ 57.160374] pci 0000:00:02.5: bridge window [mem 0x80000000-0x801fffff]
> [ 57.161268] pci 0000:00:02.5: bridge window [mem
> 0x393000000000-0x3937ffffffff 64bit pref]
> [ 57.176995] pci 0000:00:1f.0: [8086:2918] type 00 class 0x060100
> conventional PCI endpoint
> [ 57.178146] pci 0000:00:1f.0: quirk: [io 0x0600-0x067f] claimed by
> ICH6 ACPI/GPIO/TCO
> [ 57.179162] pci 0000:00:1f.2: [8086:2922] type 00 class 0x010601
> conventional PCI endpoint
> [ 57.186370] pci 0000:00:1f.2: BAR 4 [io 0x6040-0x605f]
> [ 57.187830] pci 0000:00:1f.2: BAR 5 [mem 0x81c80000-0x81c80fff]
> [ 57.189464] pci 0000:00:1f.3: [8086:2930] type 00 class 0x0c0500
> conventional PCI endpoint
> [ 57.197117] pci 0000:00:1f.3: BAR 4 [io 0x6000-0x603f]
> [ 57.199155] acpiphp: Slot [0] registered
> [ 57.199695] pci 0000:01:00.0: [1af4:1041] type 00 class 0x020000
> PCIe Endpoint
> [ 57.202062] pci 0000:01:00.0: BAR 1 [mem 0x81a00000-0x81a00fff]
> [ 57.205640] pci 0000:01:00.0: BAR 4 [mem
> 0x38e800000000-0x38e800003fff 64bit pref]
> [ 57.213753] pci 0000:01:00.0: ROM [mem 0xfff80000-0xffffffff pref]
> [ 57.215477] pci 0000:00:01.0: PCI bridge to [bus 01]
> [ 57.216661] acpiphp: Slot [0-2] registered
> [ 57.216846] pci 0000:02:00.0: [1b36:000d] type 00 class 0x0c0330
> PCIe Endpoint
> [ 57.218123] pci 0000:02:00.0: BAR 0 [mem 0x81800000-0x81803fff 64bit]
> [ 57.221585] pci 0000:00:01.1: PCI bridge to [bus 02]
> [ 57.225522] acpiphp: Slot [0-3] registered
> [ 57.226056] pci 0000:03:00.0: [1af4:1043] type 00 class 0x078000
> PCIe Endpoint
> [ 57.228576] pci 0000:03:00.0: BAR 1 [mem 0x81600000-0x81600fff]
> [ 57.231435] pci 0000:03:00.0: BAR 4 [mem
> 0x38f800000000-0x38f800003fff 64bit pref]
> [ 57.236846] pci 0000:00:01.2: PCI bridge to [bus 03]
> [ 57.238158] acpiphp: Slot [0-4] registered
> [ 57.238682] pci 0000:04:00.0: [1af4:1042] type 00 class 0x010000
> PCIe Endpoint
> [ 57.241207] pci 0000:04:00.0: BAR 1 [mem 0x81400000-0x81400fff]
> [ 57.245502] pci 0000:04:00.0: BAR 4 [mem
> 0x390000000000-0x390000003fff 64bit pref]
> [ 57.257182] pci 0000:00:01.3: PCI bridge to [bus 04]
> [ 57.258719] acpiphp: Slot [0-5] registered
> [ 57.259247] pci 0000:05:00.0: [1af4:1042] type 00 class 0x010000
> PCIe Endpoint
> [ 57.261991] pci 0000:05:00.0: BAR 1 [mem 0x81200000-0x81200fff]
> [ 57.268026] pci 0000:05:00.0: BAR 4 [mem
> 0x390800000000-0x390800003fff 64bit pref]
> [ 57.270867] pci 0000:00:01.4: PCI bridge to [bus 05]
> [ 57.272119] acpiphp: Slot [0-6] registered
> [ 57.272653] pci 0000:06:00.0: [10de:2330] type 00 class 0x030200
> PCIe Endpoint
> [ 60.512788] pci 0000:06:00.0: BAR 0 [mem
> 0x382002000000-0x382002ffffff 64bit pref]
Now we've gotten to the device and it looks like one...
> [ 63.740786] pci 0000:06:00.0: BAR 2 [mem
> 0x380000000000-0x381fffffffff 64bit pref]
two...
> [ 66.956785] pci 0000:06:00.0: BAR 4 [mem
> 0x382000000000-0x382001ffffff 64bit pref]
three more disables at the command register. So we're really being
bitten that we toggle decode-enable/memory enable around reading each
BAR size. I'd expect though that these latter three would still be
required during hotplug. Do you have similar logs from that operation?
> [ 70.188836] pci 0000:06:00.0: Max Payload Size set to 128 (was 256, max 256)
Looks like there might even be one more before we get here. Thanks,
Alex
Powered by blists - more mailing lists