netdev - RE: [net-next 5/9] e1000e: Disable ASPM L1 on 82574

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9BBC4E0CF881AA4299206E2E1412B6261882C3A9@FMSMSX151.amr.corp.intel.com>
Date:	Thu, 3 May 2012 20:12:47 +0000
From:	"Wyborny, Carolyn" <carolyn.wyborny@...el.com>
To:	Nix <nix@...eri.org.uk>,
	"Kirsher, Jeffrey T" <jeffrey.t.kirsher@...el.com>
CC:	"davem@...emloft.net" <davem@...emloft.net>,
	Chris Boot <bootc@...tc.net>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"gospo@...hat.com" <gospo@...hat.com>,
	"sassmann@...hat.com" <sassmann@...hat.com>
Subject: RE: [net-next 5/9] e1000e: Disable ASPM L1 on 82574



>-----Original Message-----
>From: Nix [mailto:nix@...eri.org.uk]
>Sent: Thursday, May 03, 2012 3:09 AM
>To: Kirsher, Jeffrey T
>Cc: davem@...emloft.net; Chris Boot; netdev@...r.kernel.org;
>gospo@...hat.com; sassmann@...hat.com; Wyborny, Carolyn
>Subject: Re: [net-next 5/9] e1000e: Disable ASPM L1 on 82574
>
 [..]
>(reminder: this is known not to fix the instance of this problem I am
>experiencing, where ASPM is being re-enabled by something even if turned
>off via setpci during boot, though it does fix those instances seen by
>others where that doesn't happen. I'd have done more printf()-scattering
>debugging to see where it's turned back on if it wasn't that this is
>happening on an always-on server for which rebooting outside the dead of
>night is a long-winded chore...)
>
>FWIW I have also seen -- very rare -- lockups of the same nature on
>82574L links in 100MbE mode using non-jumbo frames. However they are far
>more common on GbE jumbo-framed links, normally taking less than an hour
>to take the link down with a wildly corrupted register set (as shown by
>ethtool).
>
>(It's annoying this firmware isn't flashable so we could just *fix* this
>bug rather than working around it. :( )
>
>
>I think I might cheat a bit next and printk_once() the state of ASPM L1
>on the errant PCI device from inside the scheduler when it flips from L1
>off to L1 on again. At 100 tests per second that should indicate at what
>time the thing is turned back on fairly tightly: even if not providing a
>direct clue as to which bit of the kernel is doing it, if I combine it
>with a set -x in userspace it should at least indicate what bit of the
>boot process is happening at the same time. It'll be the weekend before
>I can try that though.
>
>--
>NULL && (void)

Hello,

It would be good to know why/how your system is re-enabling the setting.  The problem is not solvable in firmware unfortunately and is somewhat platform dependent. MMIO-tracer might be used to try and see when the re-enabling config space is written, but it might be too heavyweight for a live production system.

I am also working on a patch to the driver to detect the condition and then do a slot reset to avoid a whole system reboot.  Would you be willing to test it in your problem system?

Thanks,

Carolyn

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html