lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Tue, 19 Feb 2008 08:47:05 -0800
From:	"Kok, Auke" <auke-jan.h.kok@...el.com>
To:	Bernd Schubert <bernd-schubert@....de>
CC:	netdev@...r.kernel.org
Subject: Re: e1000: Detected Tx Unit Hang

Bernd Schubert wrote:
> On Saturday 16 February 2008, Kok, Auke wrote:
>> Bernd Schubert wrote:
>>> Hello,
>>>
>>> I can't login to one of our servers and just got this in an ipmi sol
>>> session:
>>>
>>> [18169.209181] e1000: eth0: e1000_clean_tx_irq: Detected Tx Unit Hang
>>> [18169.209183]   Tx Queue             <0>
>>> [18169.209184]   TDH                  <e3>
>>> [18169.209185]   TDT                  <e3>
>>> [18169.209186]   next_to_use          <e3>
>>> [18169.209187]   next_to_clean        <bd>
>>> [18169.209188] buffer_info[next_to_clean]
>>> [18169.209189]   time_stamp           <10043e4d2>
>>> [18169.209190]   next_to_watch        <be>
>>> [18169.209191]   jiffies              <10043e6f6>
>>> [18169.209192]   next_to_watch.status <1>
>>> [18169.256978] e1000: eth2: e1000_clean_tx_irq: Detected Tx Unit Hang
>>> [18169.256979]   Tx Queue             <0>
>>> [18169.256980]   TDH                  <de>
>>> [18169.256982]   TDT                  <de>
>>> [18169.256983]   next_to_use          <de>
>>> [18169.256984]   next_to_clean        <bc>
>>> [18169.256985] buffer_info[next_to_clean]
>>> [18169.256986]   time_stamp           <10043e511>
>>> [18169.256987]   next_to_watch        <bd>
>>> [18169.256988]   jiffies              <10043e701>
>>> [18169.256989]   next_to_watch.status <1>
>>>
>>> This is with 2.6.22.18. Is there any chance to recover the system? For
>>> some reasons I would prefer not to reboot now.
>> if that's all you have then it was false alarm. there should be a 'netdev
>> timeout - link reset' following those messages. can you send some more
>> context on those messages?
> 
> All I presently know is that there are 20 servers and login doesn't work any 
> more - sysrq+t does show me it hangs in fuse, which is accessing the 
> underlying nfs (we are using unionfs-fuse). While I checked the sysrq-t 
> output suddenly these e1000 messages appeared.
> Thinking a bit about it, it either could be 2.6.22.18 has an e1000 bug, which 
> 2.6.22.X didn't have (X=16, I think, but I'm not sure) or someone  
> mis-configured the switch/network environment today. 
> Hmm, now that I think about the last part, there already had been other 
> networking problems today, which were supposed to be fixed several hours ago. 
> Seems they didn't fix it properly.
> 
>> in real tx hang cases, the hardware is reset within 2 seconds, and
>> everything continues as normal.
> 
> Thanks, this gives me hope I don't need to reboot the serves (reboot would 
> mean I would need to start 60 md-raid rebuilds...).

my first thought after I read this e-mail is that the tx-hang message is just a
symptom of your system not responding or being spinlocked all the time. These TX
hang issues normally completely do not interfere with normal system operation and
unless you have continuous TX resets you would be able to logon perfectly fine.

I think you might have hit another kernel bug here... perhaps even unionfs/fuse
related and that certainly looks plausible from your problem description.

looking at the changelog for 2.6.22.16->2.6.22.18 I can't see anything relevant
(see
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.22.y.git;a=shortlog),
but there are definately no e1000 driver changes in that range anyway.

I don't suppose you can do a git-bisect? that would certainly help. I don't think
we can rule out anything just yet here.

At least try to revert some of your systems to the previous kernel version and see
if the problem goes away...

Auke
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists