netdev - RE: Linux 2.6.27.13

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <F169D4F5E1F1974DBFAFABF47F60C10A1C2250D0@orsmsx507.amr.corp.intel.com>
Date:	Mon, 26 Jan 2009 16:37:29 -0800
From:	"Brandeburg, Jesse" <jesse.brandeburg@...el.com>
To:	Greg KH <gregkh@...e.de>, Jesper Krogh <jesper@...gh.cc>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
CC:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	"e1000-devel@...ts.sourceforge.net" 
	<e1000-devel@...ts.sourceforge.net>
Subject: RE: Linux 2.6.27.13

Greg KH wrote:
> On Mon, Jan 26, 2009 at 09:01:36PM +0100, Jesper Krogh wrote:
>> Greg KH wrote:
>>> We (the -stable team) are announcing the release of the 2.6.27.13
>>> kernel. It contains a wide range of bugfixes, and all users of the
>>> 2.6.27 kernel series are strongly encouraged to upgrade.
>>> I'll also be replying to this message with a copy of the patch
>>> between 
>>> 2.6.27.12 and 2.6.27.13
>> 
>> Hi.
>> 
>> I'm getting some e1000 noise on a 2.6.27.6, I search the log up to
>> .13 but couldn't find any log messsage that looked like it fixed it.
>> 
>> 
>> [862734.501786] ------------[ cut here ]------------
>> [862734.501793] WARNING: at net/sched/sch_generic.c:219
>> dev_watchdog+0x1f8/0x210() [862734.501795] NETDEV WATCHDOG: eth0
>> (e1000): transmit timed out 
> 
> I've been getting a lot of reports about this as well.  Did it show up
> in 2.6.27.6?
> 
> Netdev developers, any ideas of what would be causing this?

no immediate idea, but a quick test to help isolate which functionality
could be causing problems is to disable TSO on all four interfaces using
ethtool.

It could be that GSO is somehow playing into this as well, but I don't
know why (you could disable it with ethtool too).

It could be unrelated but I've noticed that TCP window size can grow much
larger now than it used to (especially talking to LRO enabled clients) 
and this might cause some kind of an overflow in the TCP transmit
offloading hardware in the e1000 parts.


>> 
>> Complete dmesg here:
>> http://krogh.cc/~jesper/dmesg-2.6.27.6.txt
>> 
>> The system is running with bonded interfaces with  (lspci output)
>> 06:01.0 Ethernet controller: Intel Corporation 82546EB Gigabit
>> Ethernet Controller (Copper) (rev 03) 06:01.1 Ethernet controller:
>> Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev
>> 03) 06:02.0 Ethernet controller: Intel Corporation 82546EB Gigabit
>> Ethernet Controller (Copper) (rev 03) 06:02.1 Ethernet controller:
>> Intel Corporation 82546EB Gigabit Ethernet Controller (Copper) (rev
>> 03)   
>> 
>> The system is still "fully functional", and I havent notiched
>> anything wrong, but there sure is a lot of link ups and downs on
>> that bond. 

in your log I saw one tx timeout for each interface, one first one by itself
and then several more all within a few minutes, but then no more for
a really long time.

My first reaction is to ask you what test you're running, and ask you to
run the e1000_dump code (see google) to dump the tx descriptor rings at 
the time of failure.

I can get you that code with updates if you're willing to test, but 
it might take a couple of days.

Jesse--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html