[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<SN6PR02MB4157E31D8448CD9D81D518C5D451A@SN6PR02MB4157.namprd02.prod.outlook.com>
Date: Thu, 17 Jul 2025 23:06:39 +0000
From: Michael Kelley <mhklinux@...look.com>
To: "dan.j.williams@...el.com" <dan.j.williams@...el.com>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>
CC: "linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>, "lukas@...ner.de"
<lukas@...ner.de>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "Jonathan.Cameron@...wei.com"
<Jonathan.Cameron@...wei.com>, Suzuki K Poulose <suzuki.poulose@....com>,
Lorenzo Pieralisi <lpieralisi@...nel.org>, Rob Herring <robh@...nel.org>, "K.
Y. Srinivasan" <kys@...rosoft.com>, Haiyang Zhang <haiyangz@...rosoft.com>,
Wei Liu <wei.liu@...nel.org>, Dexuan Cui <decui@...rosoft.com>, "open
list:Hyper-V/Azure CORE AND DRIVERS" <linux-hyperv@...r.kernel.org>
Subject: RE: [PATCH 2/3] PCI: Enable host bridge emulation for
PCI_DOMAINS_GENERIC platforms
From: dan.j.williams@...el.com <dan.j.williams@...el.com> Sent: Thursday, July 17, 2025 12:59 PM
>
> Michael Kelley wrote:
> > From: Dan Williams <dan.j.williams@...el.com> Sent: Wednesday, July 16, 2025 9:09 AM
>
> Thanks for taking a look Michael!
>
> [..]
> > > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > > index e9448d55113b..833ebf2d5213 100644
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -6692,9 +6692,50 @@ static void pci_no_domains(void)
> > > #endif
> > > }
> > >
> > > +#ifdef CONFIG_PCI_DOMAINS
> > > +static DEFINE_IDA(pci_domain_nr_dynamic_ida);
> > > +
> > > +/*
> > > + * Find a free domain_nr either allocated by pci_domain_nr_dynamic_ida or
> > > + * fallback to the first free domain number above the last ACPI segment number.
> > > + * Caller may have a specific domain number in mind, in which case try to
> > > + * reserve it.
> > > + *
> > > + * Note that this allocation is freed by pci_release_host_bridge_dev().
> > > + */
> > > +int pci_bus_find_emul_domain_nr(int hint)
> > > +{
> > > + if (hint >= 0) {
> > > + hint = ida_alloc_range(&pci_domain_nr_dynamic_ida, hint, hint,
> > > + GFP_KERNEL);
> >
> > This almost preserves the existing functionality in pci-hyperv.c. But if the
> > "hint" passed in is zero, current code in pci-hyperv.c treats that as a
> > collision and allocates some other value. The special treatment of zero is
> > necessary per the comment with the definition of HVPCI_DOM_INVALID.
> >
> > I don't have an opinion on whether the code here should treat a "hint"
> > of zero as invalid, or whether that should be handled in pci-hyperv.c.
>
> Oh, I see what you are saying. I made the "hint == 0" case start working
> where previously it should have failed. I feel like that's probably best
> handled in pci-hyperv.c with something like the following which also
> fixes up a regression I caused with @dom being unsigned:
>
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index cfe9806bdbe4..813757db98d1 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3642,9 +3642,9 @@ static int hv_pci_probe(struct hv_device *hdev,
> {
> struct pci_host_bridge *bridge;
> struct hv_pcibus_device *hbus;
> - u16 dom_req, dom;
> + int ret, dom = -EINVAL;
> + u16 dom_req;
> char *name;
> - int ret;
>
> bridge = devm_pci_alloc_host_bridge(&hdev->device, 0);
> if (!bridge)
> @@ -3673,7 +3673,8 @@ static int hv_pci_probe(struct hv_device *hdev,
> * collisions) in the same VM.
> */
> dom_req = hdev->dev_instance.b[5] << 8 | hdev->dev_instance.b[4];
> - dom = pci_bus_find_emul_domain_nr(dom_req);
> + if (dom_req)
> + dom = pci_bus_find_emul_domain_nr(dom_req);
No, I don't think this is right either. If dom_req is 0, we don't want to
hv_pci_probe() to fail. We want the "collision" path to be taken so that
some other unused PCI domain ID is assigned. That could be done by
passing -1 as the hint to pci_bus_bind_emul_domain_nr(). Or PCI
domain ID 0 could be pre-reserved in init_hv_pci_drv() like is done
with HVPCI_DOM_INVALID in current code.
>
> if (dom < 0) {
> dev_err(&hdev->device,
>
> > > +
> > > + if (hint >= 0)
> > > + return hint;
> > > + }
> > > +
> > > + if (acpi_disabled)
> > > + return ida_alloc(&pci_domain_nr_dynamic_ida, GFP_KERNEL);
> > > +
> > > + /*
> > > + * Emulated domains start at 0x10000 to not clash with ACPI _SEG
> > > + * domains. Per ACPI r6.0, sec 6.5.6, _SEG returns an integer, of
> > > + * which the lower 16 bits are the PCI Segment Group (domain) number.
> > > + * Other bits are currently reserved.
> > > + */
> >
> > Back in 2018 and 2019, the Microsoft Linux team encountered problems with
> > PCI domain IDs that exceeded 0xFFFF. User space code, such as the Xorg X server,
> > assumed PCI domain IDs were at most 16 bits, and retained only the low 16 bits
> > if the value was larger. My memory of the details is vague, but I believe some
> > or all of this behavior was tied to libpciaccess. As a result of these user space
> > limitations, the pci-hyperv.c code made sure to not create any domain IDs
> > larger than 0xFFFF. The problem was not just theoretical -- Microsoft had
> > customers reporting issues due to the "32-bit domain ID problem" and the
> > pci-hyperv.c code was updated to avoid it.
> >
> > I don't have information on whether user space code has been fixed, or
> > the extent to which such a fix has propagated into distro versions. At the
> > least, a VM with old user space code might break if the kernel is upgraded
> > to one with this patch. How do people see the risks now that it is 6 years
> > later? I don't have enough data to make an assessment.
>
> A couple observations:
>
> - I think it would be reasonable to not fallback in the hint case with
> something like this:
We *do* need the fallback in the hint case. If the hint causes a collision
(i.e., another device is already using the hinted PCI domain ID), then we
need to choose some other PCI domain ID. Again, we don't want hv_pci_probe()
to fail for the device because the value of bytes 4 and 5 chosen from device's
GUID (as assigned by Hyper-V) accidently matches bytes 4 and 5 of some other
device's GUID. Hyper-V guarantees the GUIDs are unique, but not bytes 4 and
5 standing alone. Current code behaves like the acpi_disabled case in your
patch, and picks some other unused PCI domain ID in the 1 to 0xFFFF range.
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 833ebf2d5213..0bd2053dbe8a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -6705,14 +6705,10 @@ static DEFINE_IDA(pci_domain_nr_dynamic_ida);
> */
> int pci_bus_find_emul_domain_nr(int hint)
> {
> - if (hint >= 0) {
> - hint = ida_alloc_range(&pci_domain_nr_dynamic_ida, hint, hint,
> + if (hint >= 0)
> + return ida_alloc_range(&pci_domain_nr_dynamic_ida, hint, hint,
> GFP_KERNEL);
>
> - if (hint >= 0)
> - return hint;
> - }
> -
> if (acpi_disabled)
> return ida_alloc(&pci_domain_nr_dynamic_ida, GFP_KERNEL);
>
> - The VMD driver has been allocating 32-bit PCI domain numbers since
> v4.5 185a383ada2e ("x86/PCI: Add driver for Intel Volume Management
> Device (VMD)"). At a minimum if it is still a problem, it is a shared
> problem, but the significant deployment of VMD in the time likely
> indicates it is ok. If not, the above change at least makes the
> hyper-v case avoid 32-bit domain numbers.
The problem we encountered in 2018/2019 was with graphics devices
and the Xorg X Server, specifically with the PCI domain ID stored in
xorg.conf to identify the graphics device that the X Server was to run
against. I don't recall ever seeing a similar problem with storage or NIC
devices, but my memory could be incomplete. It's plausible that user
space code accessing the VMD device correctly handled 32-bit domain
IDs, but that's not necessarily an indicator for user space graphics
software. The Xorg X Server issues would have started somewhere after
commit 4a9b0933bdfc in the 4.11 kernel, and were finally fixed in the 5.4
kernel with commits be700103efd10 and f73f8a504e279.
All that said, I'm not personally averse to trying again in assigning a
domain ID > 0xFFFF. I do see a commit [1] to fix libpciaccess that was
made 7 years ago in response to the issues we were seeing on Hyper-V.
Assuming those fixes have propagated into using packages like X Server,
then we're good. But someone from Microsoft should probably sign off
on taking this risk. I retired from Microsoft nearly two years ago, and
meddle in things from time-to-time without the burden of dealing
with customer support issues. ;-)
[1] https://gitlab.freedesktop.org/xorg/lib/libpciaccess/-/commit/a167bd6474522a709ff3cbb00476c0e4309cb66f
Powered by blists - more mailing lists