lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMaF-rOBgHvZudOqrBu_LT+MQJ4sejpNHEL-JeCsP4p54ZnJeA@mail.gmail.com>
Date:	Fri, 30 Sep 2011 10:35:40 -0500
From:	Jon Mason <mason@...i.com>
To:	Bjorn Helgaas <bhelgaas@...gle.com>
Cc:	Avi Kivity <avi@...hat.com>, Sven Schnelle <svens@...ckframe.org>,
	Simon Kirby <sim@...tway.ca>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Niels Ole Salscheider <niels_ole@...scheider-online.de>,
	Jesse Barnes <jbarnes@...tuousgeek.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	linux-kernel <linux-kernel@...r.kernel.org>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>,
	Ben Hutchings <bhutchings@...arflare.com>
Subject: Re: Workaround for Intel MPS errata

On Fri, Sep 30, 2011 at 12:01 AM, Bjorn Helgaas <bhelgaas@...gle.com> wrote:
> On Thu, Sep 29, 2011 at 6:16 PM, Jon Mason <mason@...i.com> wrote:
>> Hey Avi,
>> Can you try this patch?  It should resolve the issue you are seeing.
>>
>> Thanks,
>> Jon
>>
>>    PCI: Workaround for Intel MPS errata
>>
>>    Intel 5000 and 5100 series memory controllers have a known issue if read
>>    completion coalescing is enabled (the default setting) and the PCI-E
>>    Maximum Payload Size is set to 256B.  To work around this issue, disable
>>    read completion coalescing if the MPS is 256B.
>
> I'd much rather see this done as an early quirk so it doesn't clutter probe.c.
>
> I don't know how you decide whether
>    - no coalescing with MPS=256, or
>    - coalescing with MPS=128
> is better.  I suspect that having a quirk that doesn't change the
> setting, but merely limits MPS to 128 if the BIOS enabled coalescing,
> would be simplest and would stay in the best-tested chipset
> configuration.

This is what I was debating yesterday.  Is it better to disable
coalescing and get better throughput (which could be a net negative if
the MPS isn't 256) or never allow it to be greater than 128?  There is
no way of knowing at quirk time if the disable is necessary or not,
only when setting the MPS is it known (which is why I did it this
way).  I could, as you suggest, simply read the bit and see if it is
enabled by the BIOS (which I'd bet it is every single time), and then
limit the MPS to 128 as a quirk.  This would be fairly simple to do.
However, the errata from Intel says Windows 2008 always disables the
coalescing and sets the MPS to 256B.  With this known, Linux's I/O
performance would be less than Windows on these systems.  While it
might not matter to some of us, I imagine the distros will care.  I'll
defer to your judgement in this matter.

> If you do end up changing the coalescing setting, I think you should
> check the current setting, then log something in dmesg if you change
> it.  If the quirk limits the max MPS, I think dmesg should reflect
> that, too.

Good point, this should be made more obvious to the user.

> I think you're missing a pci_dev_put().

Good catch.

> Even if we work out these issues and Avi reports that it fixes his
> hang, I'm still concerned about MPS configuration being enabled by
> default in 3.1.  It feels like we happened to trip over a few issues
> and fix them, but I'm worried that with wider testing, we'll find
> more.
>
> I'd feel more comfortable if everything were disabled in 3.1 (it'd be
> OK to have a command-line flag to enable it), then turned on by
> default early in the 3.2 merge window.

Okay, I'll create a patch to do this as well.

>
> Bjorn
>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index a919db2..13c733a 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -1361,6 +1361,80 @@ static int pcie_find_smpss(struct pci_dev *dev, void *data)
>>        return 0;
>>  }
>>
>> +static void pcie_errata_check(int mps)
>> +{
>> +       struct pci_dev *dev = NULL;
>> +       static bool done = false;
>> +
>> +       if (done)
>> +               return;
>> +
>> +       /* Intel 5000 and 5100 Memory controllers have an errata with read
>> +        * completion coalescing (which is enabled by default) and MPS of 256B.
>> +        */
>> +       /* 5000X Chipset Memory Controller Hub */
>> +       dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x25C0, NULL);
>> +       if (dev)
>> +               goto fixup;
>> +
>> +       /* 5000Z Chipset Memory Controller Hub */
>> +       dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x25D0, NULL);
>> +       if (dev)
>> +               goto fixup;
>> +
>> +       /* 5000V Chipset Memory Controller Hub */
>> +       dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x25D4, NULL);
>> +       if (dev)
>> +               goto fixup;
>> +
>> +       /* 5000P Chipset Memory Controller Hub */
>> +       dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x25D8, NULL);
>> +       if (dev)
>> +               goto fixup;
>> +
>> +       /* 5100 Chipset Memory Controller Hub */
>> +       dev = pci_get_device(PCI_VENDOR_ID_INTEL, 0x65C0, NULL);
>> +       if (dev)
>> +               goto fixup;
>> +
>> +       /* No Intel 5000 or 5100 Memory controller in the system, no need to
>> +        * check again
>> +        */
>> +       if (!dev) {
>> +               done = true;
>> +               return;
>> +       }
>> +
>> +fixup:
>> +       /* Disable read completion coalescing to allow an MPS of 256 */
>> +       if (mps == 256) {
>> +               int err;
>> +               u16 rcc;
>> +
>> +               /* Intel errata specifies bits to change but does not say what
>> +                * they are.  Keeping them magical until such time as the
>> +                * registers and values can be explained.
>> +                */
>> +               err = pci_read_config_word(dev, 0x48, &rcc);
>> +               if (err) {
>> +                       dev_err(&dev->dev, "Error attempting to read the read "
>> +                               "completion coalescing register.\n");
>> +                       return;
>> +               }
>> +
>> +               rcc &= ~(1 << 10);
>> +
>> +               err = pci_write_config_word(dev, 0x48, rcc);
>> +               if (err) {
>> +                       dev_err(&dev->dev, "Error attempting to read the read "
>> +                               "completion coalescing register.\n");
>> +                       return;
>> +               }
>> +
>> +               done = true;
>> +       }
>> +}
>> +
>>  static void pcie_write_mps(struct pci_dev *dev, int mps)
>>  {
>>        int rc;
>> @@ -1384,6 +1458,8 @@ static void pcie_write_mps(struct pci_dev *dev, int mps)
>>                        mps = min(mps, pcie_get_mps(dev->bus->self));
>>        }
>>
>> +       pcie_errata_check(mps);
>> +
>>        rc = pcie_set_mps(dev, mps);
>>        if (rc)
>>                dev_err(&dev->dev, "Failed attempting to set the MPS\n");
>> @@ -1445,7 +1521,7 @@ static int pcie_bus_configure_set(struct pci_dev *dev, void *data)
>>        return 0;
>>  }
>>
>> -/* pcie_bus_configure_mps requires that pci_walk_bus work in a top-down,
>> +/* pcie_bus_configure_settings requires that pci_walk_bus work in a top-down,
>>  * parents then children fashion.  If this changes, then this code will not
>>  * work as designed.
>>  */
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ