linux-kernel - Re: WARNING at drivers/pci/search.c:214 for 3.9

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130507153349.4d03040a@pluto.restena.lu>
Date:	Tue, 7 May 2013 15:33:49 +0200
From:	Bruno Prémont <bonbons@...ux-vserver.org>
To:	Borislav Petkov <bp@...en8.de>
Cc:	LKML <linux-kernel@...r.kernel.org>,
	Linux-ACPI <linux-acpi@...r.kernel.org>,
	Len Brown <lenb@...nel.org>, "Rafael J. Wysocki" <rjw@...k.pl>,
	Lance Ortiz <lance.ortiz@...com>,
	Tony Luck <tony.luck@...el.com>,
	Matthew Garrett <mjg59@...f.ucam.org>
Subject: Re: WARNING at drivers/pci/search.c:214 for 3.9

On Tue, 7 May 2013 12:38:30 +0200 Borislav Petkov wrote:
> On Tue, May 07, 2013 at 08:52:05AM +0200, Bruno Prémont wrote:
> > Better that way (log_buf_len=10M)!
> > 
> > The full boot log is available at:
> >   http://pastebin.com/hVVne14C
> > (the Hardware Error message is there right before the series of
> > WARNINGs)
> 
> Yep, thanks.
> 
> So your error doesn't happen straight after the box has booted but
> later, ~70 seconds within the boot. I'm guessing that's reproducible?
> Are you doing something specific right after the machine is booted? It
> doesn't look so to me because you're in cpu_idle when the timer IRQ
> happens.
> 
> It looks like this is the polling interval that comes from the GHES
> gunk.
> 
> I guess what I'm trying to say is, are you doing something special to
> cause the PCIe error or it just happens while the machine is idle?

No, not doing anything special (except maybe boot a vanilla Linux kernel
compiled myself).
That happens even when booting into init=/bin/bash and just starring
at the monitor.

> What about a BIOS update?

Last time I checked (update-DVD) there was none (some-when past winter)

Checking online now there is one, though release information does not
include details...

  BIOS V4.6.5.3 R2.21.0 for RX200 S7
  ==================================
  included components:
   VGA: MATROX/MGA-G200 VGA/VBE BIOS (V3.8SQ) b33
   LAN: PXE OPROM: Intel(R) Boot Agent GE v1.3.72 PXE 2.1 Build 089
   LAN: iSCSI OPROM: iSCSI Remote Boot version 2.7.97
   Intel Reference Code Package for Romley v1.0.023
   Intel SAS OPROM v3.1.0.2101
   Patsburg SCU: LSI SAS OPROM SCU.11.08021201P

  Added Changes/Fixed Issues in from Rev 2.19.0 to Rev. R2.21.0:
  ==============================================================
  - fix for VIOM

  Added Changes/Fixed Issues in from Rev 2.16.0 to Rev. R2.19.0:
  ==============================================================
  - new Intel Reference Code
  - some minor bug fixes

  Added Changes/Fixed Issues in from Rev 2.4.0 to Rev. R2.16.0:
  ==============================================================
  - Update LSI SCU option ROM to version 11.08021201P
  - some minor bug fixes
  - fix for LRDIMM
  - Correct the settings for BIOS Setup SATA configuration
  - fixes for WHEA
  - fixes for TPM

Original BIOS revision was 2.4.0.
>From download page 2.4.0 was released in August 2012,
                   2.16.0 was released in January 2013
                   2.21.0 was released in April 2013

With the BIOS updated, the error message is gone (both the Hardware
error, and the WARNINGs triggered by attempting to lookup the source
PCIe device)
Not sure which of the two public updates did the fix...

> > > > For older kernels (3.8.x and older) I only have:
> > > > [   65.741777] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
> > > > [   65.763335] {1}[Hardware Error]: APEI generic hardware error status
> > > > [   65.782650] {1}[Hardware Error]: severity: 2, corrected
> > > > [   65.782652] {1}[Hardware Error]: section: 0, severity: 2, corrected
> > > > [   65.782653] {1}[Hardware Error]: flags: 0x01
> > > > [   65.782655] {1}[Hardware Error]: primary
> > > > [   65.782656] {1}[Hardware Error]: fru_text: CorrectedErr
> > > > [   65.782658] {1}[Hardware Error]: section_type: PCIe error
> > > > [   65.782659] {1}[Hardware Error]: port_type: 0, PCIe end point
> > > > [   65.782660] {1}[Hardware Error]: version: 0.0
> > > > [   65.782662] {1}[Hardware Error]: command: 0xffff, status: 0xffff
> > > > [   65.782664] {1}[Hardware Error]: device_id: 0000:00:02.3
> > > 
> > > Interesting. AFAICT, you don't have such device in lspci below.
> > 
> > Yes it has been that way from the start and under BIOS settings I've
> > found nothing that would make mentioned device visible.
> 
> Hmm, so it could be some hidden device or maybe the error info is
> corrupted. Btw, it also says:
> 
> [   72.948961] PCI AER Cannot get PCI device 0000:00:00.3
> 
> which is also a device you *don't* find in lspci.
> 
> This is fun - detecting PCIe devices by the errors they generate.
> Hahahaha.
> 
> To tell you the truth, nothing will surprise me anymore. :-)

Hidden device, but not hidden well enough :)

> > > > [   65.782665] {1}[Hardware Error]: slot: 0
> > > > [   65.782666] {1}[Hardware Error]: secondary_bus: 0x00
> > > > [   65.782667] {1}[Hardware Error]: vendor_id: 0xffff, device_id: 0xffff
> > > > [   65.782668] {1}[Hardware Error]: class_code: ffffff
> > > > 
> > > > which was being "triggered" by
> > > >  commit 3c076351c4027a56d5005a39a0b518a4ba393ce2
> > > >  Author: Matthew Garrett <mjg@...hat.com>
> > > >  Date:   Thu Nov 10 16:38:33 2011 -0500
> > > > 
> > > >     PCI: Rework ASPM disable code
> > > 
> > > And if you revert it, the error above disappears? Adding Matthew.
> > 
> > Correct (at least on 3.0.y stable series).
> > 
> > 
> > Toggling the "ASPM support" BIOS option makes no difference.
> > 
> > I've even contacted Fujitsu but unfortunately got no useful result as
> > they only support SLES kernels,
> 
> You gotta love hw vendors' excuses. I can translate this message into
> what it actually means :)

Something like "There is no BUG on our side" (while thinking: a bug,
need to fix it silently)?

> > which have Matthew's patch reverted with
> > commit message:
> >   This reverts commit 6cac12dfab9c57a4f76821412224b226a9b08dff,
> >   upstream commit 3c076351c4027a56d5005a39a0b518a4ba393ce2.
> 
> Yeah, they got reverted for SP2 but are back in SP3:
> 
> http://kernel.opensuse.org/cgit/kernel-source/commit/?h=SLE11-SP3&id=cd825d98ec79f777c14531f402d13a66598f3179
> 
> >   My PS/2 keyboard and touchpad are not detected with this patch.
> > 
> >   This turn 3.0.20 in a noop as there is no other patch. Except
> >   numbering is correct for further patches...
> 
> I don't understand: are you saying this patch breaks detection of your
> keyboard and touchpad and if you revert it, it works again? But 3.9 works?

No, that was the commit message of the SUSE guy who performed
the revert for SUSE kernel!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/