lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAErSpo5e0GOdVbZdQYriP+aWziKELpb9BWTgJ4ERPia354CM5w@mail.gmail.com>
Date:	Thu, 11 Sep 2014 16:51:25 -0600
From:	Bjorn Helgaas <bhelgaas@...gle.com>
To:	Dirk Gouders <dirk@...ders.net>
Cc:	Yinghai Lu <yinghai@...nel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Andreas Noever <andreas.noever@...il.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	"linux-pci@...r.kernel.org" <linux-pci@...r.kernel.org>
Subject: Re: [BUG] Bisected Problem with LSI PCI FC Adapter

On Thu, Sep 11, 2014 at 3:24 PM, Dirk Gouders <dirk@...ders.net> wrote:
> Bjorn Helgaas <bhelgaas@...gle.com> writes:
>
>> On Thu, Sep 11, 2014 at 2:33 PM, Dirk Gouders <dirk@...ders.net> wrote:
>>> What I was currently trying was to construct a test-environment so that
>>> I do not need to do tests and diagnosis on a busy machine.
>>>
>>> I noticed that this problem seems to start with the narrow Root
>>> Bridge window (00-07) but every other machine that I had a look at,
>>> starts with (00-ff), so those will not trigger my problem.
>>>
>>> I thought I could perhaps try to shrink the window in
>>> acpi_pci_root_add() to trigger the problem and that kind of works: it
>>> triggers it but not exactly the same way, because it basically ends at
>>> this code in pci_scan_bridge():
>>>
>>>         if (max >= bus->busn_res.end) {
>>>                 dev_warn(&dev->dev, "can't allocate child bus %02x from %pR (pass %d)\n",
>>>                          max, &bus->busn_res, pass);
>>>                 goto out;
>>>         }
>>>
>>> If this could work but I am just missing a small detail, I would be
>>> glad to hear about it and do the first tests this way.  If it is
>>> complete nonsense, I will just use the machine that triggers the problem
>>> for the tests.
>>
>> I was about to suggest the same thing.  If the problem is related to
>> the bus number change, we should be able to force that to happen on a
>> different machine.  Your approach sounds good, so I'm guessing we just
>> need a tweak.
>>
>> I would first double-check that the PCI adapters are identical,
>> including the firmware on the card.  Can you also include your patch
>> and the resulting dmesg (with debug enabled as before)?
>
> Currently I am at home doing just tests for understanding and that I can
> hopefully use when I am back in the office.
>
> I already noticed the the backup FC Adapter on the test machine is not
> exactly the same: it is Rev. 1 whereas the one on the failing machine is
> Rev. 2.
>
> So, here at home my tests let a NIC disappear.  Different from the
> original problem but I was just trying to reconstruct the szenario of a
> misconfigured bridge causing a reconfiguration.
>
> What I was trying is:
>
> diff --git a/drivers/acpi/pci_root.c b/drivers/acpi/pci_root.c
> index e6ae603..fd146b3 100644
> --- a/drivers/acpi/pci_root.c
> +++ b/drivers/acpi/pci_root.c
> @@ -556,6 +556,7 @@ static int acpi_pci_root_add(struct acpi_device *device,
>         strcpy(acpi_device_name(device), ACPI_PCI_ROOT_DEVICE_NAME);
>         strcpy(acpi_device_class(device), ACPI_PCI_ROOT_CLASS);
>         device->driver_data = root;
> +       root->secondary.end = 0x02;
>
>         pr_info(PREFIX "%s [%s] (domain %04x %pR)\n",
>                acpi_device_name(device), acpi_device_bid(device),
>
> The device that disappears is a NIC:
>
> 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor DRAM Controller (rev 09)
> 00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor Graphics Controller (rev 09)
> 00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller (rev 04)
> 00:16.0 Communication controller: Intel Corporation 7 Series/C210 Series Chipset Family MEI Controller #1 (rev 04)
> 00:1a.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
> 00:1b.0 Audio device: Intel Corporation 7 Series/C210 Series Chipset Family High Definition Audio Controller (rev 04)
> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 1 (rev c4)
> 00:1c.4 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 5 (rev c4)
> 00:1c.5 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI Express Root Port 6 (rev c4)
> 00:1d.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a4)
> 00:1f.0 ISA bridge: Intel Corporation B75 Express Chipset LPC Controller (rev 04)
> 00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family 6-port SATA Controller [AHCI mode] (rev 04)
> 00:1f.3 SMBus: Intel Corporation 7 Series/C210 Series Chipset Family SMBus Controller (rev 04)
> 02:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)
>
> This is the one that is missing with the above change:
> 03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 06)

This situation is a little different, so I don't think you're
reproducing the situation we want to test.  On this box, you have:

    pci_bus 0000:00: root bus resource [bus 00-02]
    pci 0000:00:1c.0: PCI bridge to [bus 01]
    pci 0000:00:1c.4: PCI bridge to [bus 02]

so we find all the devices on bus 00 and bus 02 (there's nothing on
bus 01).  My guess is the 03:00.0 device is normally behind the
00:1c.5 bridge, but we don't even scan behind that bridge because we
can't allocate a secondary bus number for it (we're not smart enough
to take advantage of the empty bus 01).

On the failing box, it's different because we *do* have unused bus
number space, and we do actually reconfigure the bridge to use it.
It's just that the FC adapter doesn't respond when we use the new bus
number for it.

You might be able to do something similar on the test box by:

  - Keeping your root->secondary.end = 02 patch, so you still have [bus 00-02].
  - Ignoring bridges 00:1c.0 and 00:1c.4.  I would just test for those
devfns in pci_scan_device() and when you see them, return NULL instead
of trying to read the vendor ID.

Then 00:1c.5 is probably configured by the BIOS for [bus 03], but
that's outside the root bridge range, so we should reconfigure it to
use [bus 01].  Then we should scan behind it, and we'll probably
discover the NIC that was previously at 03:00.0.  The device *should*
just work at the new bus number, since it probably doesn't have the
same bug the FC adapter does.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ