[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250422200857.GA381276@bhelgaas>
Date: Tue, 22 Apr 2025 15:08:57 -0500
From: Bjorn Helgaas <helgaas@...nel.org>
To: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
Cc: Dan Williams <dan.j.williams@...el.com>, Lukas Wunner <lukas@...ner.de>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
"H. Peter Anvin" <hpa@...or.com>, linux-pci@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2 1/1] PCI: Add Extended Tag + MRRS quirk for Xeon 6
On Tue, Apr 22, 2025 at 04:02:07PM +0300, Ilpo Järvinen wrote:
> When bifurcated to x2, Xeon 6 Root Port performance is sensitive to the
> configuration of Extended Tags, Max Read Request Size (MRRS), and 10-Bit
> Tag Requester (note: there is currently no 10-Bit Tag support in the
> kernel). While those can be configured to the recommended values by FW,
> kernel may decide to overwrite the initial values.
>
> Unfortunately, there is no mechanism for FW to indicate OS which parts
> of PCIe configuration should not be altered. Thus, the only option is
> to add such logic into the kernel as quirks.
>
> There is a pre-existing quirk flag to disable Extended Tags. Depending
> on CONFIG_PCIE_BUS_* setting, MRRS may be overwritten by what the
> kernel thinks is the best for performance (the largest supported
> value), resulting in performance degradation instead with these Root
> Ports. (There would have been a pre-existing quirk to disallow
> increasing MRRS but it is not identical to rejecting >128B MRRS.)
>
> Add a quirk that disallows enabling Extended Tags and setting MRRS
> larger than 128B for devices under Xeon 6 Root Ports if the Root Port is
> bifurcated to x2. Reject >128B MRRS only when it is going to be written
> by the kernel (this assumes FW configured a good initial value for MRRS
> in case the kernel is not touching MRRS at all).
>
> It was first attempted to always write MRRS when the quirk is needed
> (always overwrite the initial value). That turned out to be quite
> invasive change, however, given the complexity of the initial setup
> callchain and various stages returning early when they decide no changes
> are necessary, requiring override each. As such, the initial value for
> MRRS is now left into the hands of FW.
>
> Link: https://cdrdv2.intel.com/v1/dl/getContent/837176
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
> ---
v1/rfc: https://lore.kernel.org/all/20250304135108.2599-1-ilpo.jarvinen@linux.intel.com/
I suppose it's quixotic to hope for anything better than quirks that
have to be added or updated for every new processor that comes along.
ACPI _HPX might be a possible way for the platform to tell us what to
do here. ACPI r6.5, sec 6.2.9 says it's for hot-added devices and
"Functions not configured by the platform firmware during initial
system boot" (how are we supposed to determine that?) In any case,
Linux does evaluate _HPX for every device in pci_configure_device().
I'm not sure _HPX really works; it's very general, and I would expect
to see reports of problems if firmware really tried to use it.
Or, I guess a _DSM function would be a possible way to communicate
this.
> v2:
> - Explain in changelog why FW cannot solve this on its own
> - Moved the quirk under arch/x86/pci/
> - Don't NULL check value from pci_find_host_bridge()
> - Added comment above the quirk about the performance degradation
> - Removed all setup chain 128B quirk overrides expect for MRRS write
> itself (assumes a sane initial value is set by FW)
>
> arch/x86/pci/fixup.c | 30 ++++++++++++++++++++++++++++++
> drivers/pci/pci.c | 15 ++++++++-------
> include/linux/pci.h | 1 +
> 3 files changed, 39 insertions(+), 7 deletions(-)
>
> diff --git a/arch/x86/pci/fixup.c b/arch/x86/pci/fixup.c
> index efefeb82ab61..aa9617bc4b55 100644
> --- a/arch/x86/pci/fixup.c
> +++ b/arch/x86/pci/fixup.c
> @@ -294,6 +294,36 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_MCH_PB1, pcie_r
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_MCH_PC, pcie_rootport_aspm_quirk);
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_MCH_PC1, pcie_rootport_aspm_quirk);
>
> +/*
> + * PCIe devices underneath Xeon6 PCIe Root Port bifurcated to 2x have slower
> + * performance with Extended Tags and MRRS > 128B. Workaround the performance
> + * problems by disabling Extended Tags and limiting MRRS to 128B.
> + *
> + * https://cdrdv2.intel.com/v1/dl/getContent/837176
> + */
> +static void quirk_pcie2x_no_tags_no_mrrs(struct pci_dev *pdev)
> +{
> + struct pci_host_bridge *bridge = pci_find_host_bridge(pdev->bus);
> + u32 linkcap;
> +
> + pcie_capability_read_dword(pdev, PCI_EXP_LNKCAP, &linkcap);
> + if (FIELD_GET(PCI_EXP_LNKCAP_MLW, linkcap) != 0x2)
> + return;
> +
> + bridge->no_ext_tags = 1;
> + bridge->only_128b_mrrs = 1;
> + pci_info(pdev, "Disabling Extended Tags and limiting MRRS to 128B (performance reasons due to 2x PCIe link)\n");
> +}
> +
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db0, quirk_pcie2x_no_tags_no_mrrs);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db1, quirk_pcie2x_no_tags_no_mrrs);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db2, quirk_pcie2x_no_tags_no_mrrs);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db3, quirk_pcie2x_no_tags_no_mrrs);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db6, quirk_pcie2x_no_tags_no_mrrs);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db7, quirk_pcie2x_no_tags_no_mrrs);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db8, quirk_pcie2x_no_tags_no_mrrs);
> +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, 0x0db9, quirk_pcie2x_no_tags_no_mrrs);
> +
> /*
> * Fixup to mark boot BIOS video selected by BIOS before it changes
> *
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 4d7c9f64ea24..2ca9cb30fbd3 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5941,7 +5941,7 @@ EXPORT_SYMBOL(pcie_get_readrq);
> int pcie_set_readrq(struct pci_dev *dev, int rq)
> {
> u16 v;
> - int ret;
> + int ret, max_mrrs = 4096;
> struct pci_host_bridge *bridge = pci_find_host_bridge(dev->bus);
>
> if (rq < 128 || rq > 4096 || !is_power_of_2(rq))
> @@ -5961,13 +5961,14 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
>
> v = FIELD_PREP(PCI_EXP_DEVCTL_READRQ, ffs(rq) - 8);
>
> - if (bridge->no_inc_mrrs) {
> - int max_mrrs = pcie_get_readrq(dev);
> + if (bridge->no_inc_mrrs)
> + max_mrrs = pcie_get_readrq(dev);
> + if (bridge->only_128b_mrrs)
> + max_mrrs = 128;
>
> - if (rq > max_mrrs) {
> - pci_info(dev, "can't set Max_Read_Request_Size to %d; max is %d\n", rq, max_mrrs);
> - return -EINVAL;
> - }
> + if (rq > max_mrrs) {
> + pci_info(dev, "can't set Max_Read_Request_Size to %d; max is %d\n", rq, max_mrrs);
> + return -EINVAL;
> }
>
> ret = pcie_capability_clear_and_set_word(dev, PCI_EXP_DEVCTL,
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0e8e3fd77e96..6dc7a05f4d4b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -603,6 +603,7 @@ struct pci_host_bridge {
> unsigned int ignore_reset_delay:1; /* For entire hierarchy */
> unsigned int no_ext_tags:1; /* No Extended Tags */
> unsigned int no_inc_mrrs:1; /* No Increase MRRS */
> + unsigned int only_128b_mrrs:1; /* Only 128B MRRS */
> unsigned int native_aer:1; /* OS may use PCIe AER */
> unsigned int native_pcie_hotplug:1; /* OS may use PCIe hotplug */
> unsigned int native_shpc_hotplug:1; /* OS may use SHPC hotplug */
>
> base-commit: 0af2f6be1b4281385b618cb86ad946eded089ac8
> --
> 2.39.5
>
Powered by blists - more mailing lists