lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ghbnqkeh9k.fsf@quad.gouders.net>
Date:	Fri, 12 Sep 2014 22:05:43 +0200
From:	Dirk Gouders <dirk@...ders.net>
To:	Yinghai Lu <yinghai@...nel.org>
Cc:	Bjorn Helgaas <bhelgaas@...gle.com>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andreas Noever <andreas.noever@...il.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	"linux-pci\@vger.kernel.org" <linux-pci@...r.kernel.org>
Subject: Re: [BUG] Bisected Problem with LSI PCI FC Adapter

Dirk Gouders <dirk@...ders.net> writes:

> Dirk Gouders <dirk@...ders.net> writes:
>
>> Bjorn Helgaas <bhelgaas@...gle.com> writes:
>>
>>> On Thu, Sep 11, 2014 at 3:24 PM, Dirk Gouders <dirk@...ders.net> wrote:
>>>> Bjorn Helgaas <bhelgaas@...gle.com> writes:
>>>>
>>>>> On Thu, Sep 11, 2014 at 2:33 PM, Dirk Gouders <dirk@...ders.net> wrote:
>>>>>> What I was currently trying was to construct a test-environment so that
>>>>>> I do not need to do tests and diagnosis on a busy machine.
>>>>>>
>>>>>> I noticed that this problem seems to start with the narrow Root
>>>>>> Bridge window (00-07) but every other machine that I had a look at,
>>>>>> starts with (00-ff), so those will not trigger my problem.
>>>>>>
>>>>>> I thought I could perhaps try to shrink the window in
>>>>>> acpi_pci_root_add() to trigger the problem and that kind of works: it
>>>>>> triggers it but not exactly the same way, because it basically ends at
>>>>>> this code in pci_scan_bridge():
>>>>>>
>>>>>>         if (max >= bus->busn_res.end) {
>>>>>>                 dev_warn(&dev->dev, "can't allocate child bus %02x from %pR (pass %d)\n",
>>>>>>                          max, &bus->busn_res, pass);
>>>>>>                 goto out;
>>>>>>         }
>>>>>>
>>>>>> If this could work but I am just missing a small detail, I would be
>>>>>> glad to hear about it and do the first tests this way.  If it is
>>>>>> complete nonsense, I will just use the machine that triggers the problem
>>>>>> for the tests.
>>>>>
>>>>> I was about to suggest the same thing.  If the problem is related to
>>>>> the bus number change, we should be able to force that to happen on a
>>>>> different machine.  Your approach sounds good, so I'm guessing we just
>>>>> need a tweak.
>>>>>
>>>>> I would first double-check that the PCI adapters are identical,
>>>>> including the firmware on the card.  Can you also include your patch
>>>>> and the resulting dmesg (with debug enabled as before)?
>>>>
>>>> Currently I am at home doing just tests for understanding and that I can
>>>> hopefully use when I am back in the office.
>>>>
>>>> I already noticed the the backup FC Adapter on the test machine is not
>>>> exactly the same: it is Rev. 1 whereas the one on the failing machine is
>>>> Rev. 2.
>>>>
>>>> So, here at home my tests let a NIC disappear.  Different from the
>>>> original problem but I was just trying to reconstruct the szenario of a
>>>> misconfigured bridge causing a reconfiguration.
>>>>
>>>> What I was trying is:
>>>>
>>>> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
>>>> index e6ae603..fd146b3 100644
>>>> --- a/drivers/acpi/pci_root.c
>>>> +++ b/drivers/acpi/pci_root.c
>>>> @@ -556,6 +556,7 @@ static int acpi_pci_root_add(struct acpi_device *device,
>>>>         strcpy(acpi_device_name(device), ACPI_PCI_ROOT_DEVICE_NAME);
>>>>         strcpy(acpi_device_class(device), ACPI_PCI_ROOT_CLASS);
>>>>         device->driver_data = root;
>>>> +       root->secondary.end = 0x02;
>>>>
>>>>         pr_info(PREFIX "%s [%s] (domain %04x %pR)\n",
>>>>                acpi_device_name(device), acpi_device_bid(device),
>>>>
>>>> The device that disappears is a NIC:
>>>>
>>>> 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
>>>> 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
>>>> 00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
>>>> 00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
>>>> 00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
>>>> 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)
>>>> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
>>>> 00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)
>>>> 00:1c.5 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 6 (rev c4)
>>>> 00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
>>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a4)
>>>> 00:1f.0 ISA bridge: Intel Corporation B75 Express Chipset LPC Controller (rev 04)
>>>> 00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
>>>> 00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
>>>> 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
>>>>
>>>> This is the one that is missing with the above change:
>>>> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
>>>
>>> This situation is a little different, so I don't think you're
>>> reproducing the situation we want to test.  On this box, you have:
>>>
>>>     pci_bus 0000:00: root bus resource [bus 00-02]
>>>     pci 0000:00:1c.0: PCI bridge to [bus 01]
>>>     pci 0000:00:1c.4: PCI bridge to [bus 02]
>>>
>>> so we find all the devices on bus 00 and bus 02 (there's nothing on
>>> bus 01).  My guess is the 03:00.0 device is normally behind the
>>> 00:1c.5 bridge, but we don't even scan behind that bridge because we
>>> can't allocate a secondary bus number for it (we're not smart enough
>>> to take advantage of the empty bus 01).
>>>
>>> On the failing box, it's different because we *do* have unused bus
>>> number space, and we do actually reconfigure the bridge to use it.
>>> It's just that the FC adapter doesn't respond when we use the new bus
>>> number for it.
>>>
>>> You might be able to do something similar on the test box by:
>>>
>>>   - Keeping your root->secondary.end = 02 patch, so you still have [bus 00-02].
>>>   - Ignoring bridges 00:1c.0 and 00:1c.4.  I would just test for those
>>> devfns in pci_scan_device() and when you see them, return NULL instead
>>> of trying to read the vendor ID.
>>>
>>> Then 00:1c.5 is probably configured by the BIOS for [bus 03], but
>>> that's outside the root bridge range, so we should reconfigure it to
>>> use [bus 01].  Then we should scan behind it, and we'll probably
>>> discover the NIC that was previously at 03:00.0.  The device *should*
>>> just work at the new bus number, since it probably doesn't have the
>>> same bug the FC adapter does.
>>
>> Thanks for the explanation.  I tried to ignore the two bridges but the
>> machine stopped with the "reconfiguring" message.
>>
>> Anyway, if I understood you correctly with the backup FC adapter I have
>> good chances, because there is the needed unused bus number space and I
>> don't have to ignore bridges.  I will test in a few hours and report.
>
> It turned out that there is no unused bus number space on the test
> machine:
>
> -[0000:00]-+-01.0-[01-02]--+-0d.0-[02]--
>            |               +-0e.0  Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
>            |               \-0e.1  Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
>            +-02.0  Broadcom BCM5785 [HT1000] Legacy South Bridge
>            +-02.1  Broadcom BCM5785 [HT1000] IDE
>            +-02.2  Broadcom BCM5785 [HT1000] LPC
>            +-03.0  Broadcom BCM5785 [HT1000] USB
>            +-03.1  Broadcom BCM5785 [HT1000] USB
>            +-03.2  Broadcom BCM5785 [HT1000] USB
>            +-06.0-[03]--
>            +-07.0-[04]--+-04.0  Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet
>            |            \-04.1  Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet
>            +-08.0-[05]--
>            +-09.0-[06]--
>            +-0a.0-[07]----00.0  LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter
>            +-0b.0-[08]--
>            +-0c.0  Advanced Micro Devices, Inc. [AMD/ATI] ES1000
>            +-18.0  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
>            +-18.1  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] Address Map
>            +-18.2  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] DRAM Controller
>            +-18.3  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
>            +-19.0  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
>            +-19.1  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] Address Map
>            +-19.2  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] DRAM Controller
>            \-19.3  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
>
> So, I used the following changes to create the space and limit the root
> bridge window:
>
> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
> index e6ae603..a41066b 100644
> --- a/drivers/acpi/pci_root.c
> +++ b/drivers/acpi/pci_root.c
> @@ -556,6 +556,7 @@ static int acpi_pci_root_add(struct acpi_device *device,
>  	strcpy(acpi_device_name(device), ACPI_PCI_ROOT_DEVICE_NAME);
>  	strcpy(acpi_device_class(device), ACPI_PCI_ROOT_CLASS);
>  	device->driver_data = root;
> +	root->secondary.end = 0x06;
>  
>  	pr_info(PREFIX "%s [%s] (domain %04x %pR)\n",
>  	       acpi_device_name(device), acpi_device_bid(device),
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index e3cf8a2..d8dc02d 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1323,6 +1323,9 @@ static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
>  	struct pci_dev *dev;
>  	u32 l;
>  
> +	if (devfn == PCI_DEVFN(0x08, 0) || devfn == PCI_DEVFN(0x09,0))
> +		return NULL;
> +
>  	if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000))
>  		return NULL;
>  
> And to me it looks as if that is the situation, I need for testing
> Yinghai's patches; the FC Adapter disappeared similar to that on the
> other machine (find the dmesg output at the end):
>
> -[0000:00]-+-01.0-[01-02]--+-0d.0-[02]--
>            |               +-0e.0  Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
>            |               \-0e.1  Broadcom BCM5785 [HT1000] SATA (PATA/IDE Mode)
>            +-02.0  Broadcom BCM5785 [HT1000] Legacy South Bridge
>            +-02.1  Broadcom BCM5785 [HT1000] IDE
>            +-02.2  Broadcom BCM5785 [HT1000] LPC
>            +-03.0  Broadcom BCM5785 [HT1000] USB
>            +-03.1  Broadcom BCM5785 [HT1000] USB
>            +-03.2  Broadcom BCM5785 [HT1000] USB
>            +-06.0-[03]--
>            +-07.0-[04]--+-04.0  Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet
>            |            \-04.1  Broadcom Corporation NetXtreme BCM5780 Gigabit Ethernet
>            +-0a.0-[05]--
>            +-0b.0-[06]--
>            +-0c.0  Advanced Micro Devices, Inc. [AMD/ATI] ES1000
>            +-18.0  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
>            +-18.1  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] Address Map
>            +-18.2  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] DRAM Controller
>            +-18.3  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
>            +-19.0  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
>            +-19.1  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] Address Map
>            +-19.2  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] DRAM Controller
>            \-19.3  Advanced Micro Devices, Inc. [AMD] K8 [Athlon64/Opteron] Miscellaneous Control
>
>
> So, I wanted to test the patches but unfortunately, the patches do not
> apply cleanly; git am refuses to take the files (unknown format) and git
> apply gives errors. I tried to apply them against Linus' current, v3.16,
> v3.15 and v3.14.  What am I probably doing wrong, Yinghai?

Yinghai,

I now applied your patches and tested them:

echo 1 > /sys/bus/pci/devices/0000\:00\:0a.0/pcie_link_disable
echo 0 > /sys/bus/pci/devices/0000\:00\:0a.0/pcie_link_disable

The FC Adapter did not show up, this is the dmesg output (I also tested
with 0b.0):

[  143.294168] pcieport 0000:00:0a.0: pcie_link_disable_set: lnk_ctrl = 18
[  148.284456] pcieport 0000:00:0a.0: pcie_link_disable_set: lnk_ctrl = 8
[  304.065942] pcieport 0000:00:0b.0: pcie_link_disable_set: lnk_ctrl = 18
[  309.035278] pcieport 0000:00:0b.0: pcie_link_disable_set: lnk_ctrl = 8

So, probably my test environment does not work as expected and I have to
test it on the failing machine.

I hope I applied the patches correctly, I attach them to this mail.

Dirk


View attachment "0001-PCI-Add-generic-pcie_link_disable.patch" of type "text/plain" (2762 bytes)

View attachment "0002-PCI-pciehp-Use-generic-pcie_link_disable.patch" of type "text/x-diff" (2111 bytes)

View attachment "0003-PCI-sysfs-Add-pcie-attrs-for-pcie-device-under-pci-d.patch" of type "text/plain" (2736 bytes)

View attachment "0004-PCI-Add-link_disable-in-sysfs-for-pcie-device.patch" of type "text/plain" (2292 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ