linux-kernel - Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload Size on fabric"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMaF-rOrA==mN-BaPVr8SB14XgXZWd0wMb0WP=-Xqzf7-w73Ag@mail.gmail.com>
Date:	Wed, 7 Sep 2011 13:57:28 -0700
From:	Jon Mason <mason@...i.com>
To:	Simon Kirby <sim@...tway.ca>
Cc:	Jesse Barnes <jbarnes@...tuousgeek.org>,
	Josh Boyer <jwboyer@...il.com>,
	Sven Schnelle <svens@...ckframe.org>,
	linux-kernel@...r.kernel.org, Jordan_Hargrave@...l.com
Subject: Re: [3.1-rc4] Bus Fatal Error caused by "PCI: Set PCI-E Max Payload
 Size on fabric"

On Wed, Sep 7, 2011 at 1:47 PM, Simon Kirby <sim@...tway.ca> wrote:
> On Wed, Sep 07, 2011 at 12:18:59PM -0700, Simon Kirby wrote:
>
>> On Wed, Sep 07, 2011 at 10:44:32AM -0700, Jesse Barnes wrote:
>>
>> > On Wed, 7 Sep 2011 12:52:25 -0400
>> > Josh Boyer <jwboyer@...il.com> wrote:
>> >
>> > > On Wed, Sep 7, 2011 at 12:22 PM, Sven Schnelle <svens@...ckframe.org>
>> > > wrote:
>> > > > Simon Kirby <sim@...tway.ca> writes:
>> > > >
>> > > >> Hello!
>> > > >>
>> > > >> Since trying 3.1-rc4 on a few Dell servers, all of them have
>> > > >> booted up with the amber error LED lit. "ipmitool sel list" shows:
>> > > >>
>> > > >> ?? ??1 | 09/06/2011 | 17:21:56 | Event Logging Disabled #0x72 | Log
>> > > >> area reset/cleared | Asserted 2 | 09/06/2011 | 17:25:38 | Critical
>> > > >> Interrupt #0x18 | Bus Fatal Error | Asserted 3 | 09/06/2011 |
>> > > >> 17:25:38 | Unknown #0x1a | 4 | 09/06/2011 | 17:25:38 | Unknown
>> > > >> #0x1a |
>> > > >
>> > > > I'm seeing exact the same issue on a Dell 1950 Server. If anyone
>> > > > wants me to try additional debugging/patches, feel free to do
>> > > > so. Unfortunately i don't have the time/knowledge to debug that by
>> > > > myself.
>> > >
>> > > I thought Jesse or Jon had a revert or partial fix queued up to send
>> > > to Linus, but I don't see anything in or post -rc5 yet.  That was
>> > > indicated in https://bugzilla.kernel.org/show_bug.cgi?id=42162
>> > >
>> > > Jesse, Jon?
>> >
>> > kernel.org is still down and I haven't pushed anything to github.  I
>> > asked Jon to send his patch directly to Linus today instead.
>>
>> FWIW, this patch didn't seem to fix it:
>> https://bugzilla.kernel.org/attachment.cgi?id=71222
>>
>> dmesg used to say:
>>
>> pci 0000:00:02.0: Dev MPS 128 MPSS 256 MRRS 128
>> pci 0000:00:02.0: Dev MPS 256 MPSS 256 MRRS 128
>> pci 0000:06:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>> pci 0000:06:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> pci 0000:07:00.0: Dev MPS 128 MPSS 256 MRRS 4096
>> pci 0000:07:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> pci 0000:08:00.0: Dev MPS 128 MPSS 128 MRRS 128
>> pci 0000:08:00.0: MPS configured higher than maximum supported by the device.  If a bus issue occurs, try running with pci=pcie_bus_safe.
>> pci 0000:08:00.0: Dev MPS 256 MPSS 256 MRRS 128
>> Uhhuh. NMI received for unknown reason 21 on CPU 0.
>> Do you have a strange power saving mode enabled?
>> Dazed and confused, but trying to continue
>
> Ok, I commented out the "pcie_write_mps(dev, mps);" line and the error
> stopped, but this made me realize that the pci=pcie_bus_safe option must
> have been missing. It turns out I had hacked a custom grub entry to load
> the newest kernel into grub instead of the one with the highest version
> number (grumble), so the default kopt didn't apply there.
>
> So, pci=pcie_bus_safe DOES fix this case, and I've confirmed that the
> MRRS-dissabling patch makes no difference in this case.
>
> Can we just make pci=pcie_bus_safe (as in previous behavior) the default,
> or make it not change where it would otherwise warn, or does that
> basically make the thing useless?

I have a patch that does does pcie_bus_safe as the default behavior
and does not modify the MRRS.  Would you be willing to test this patch
for me?

Thanks,
Jon

>
> Simon-
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/