[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20081202225533.GA28767@1wt.eu>
Date: Tue, 2 Dec 2008 23:55:33 +0100
From: Willy Tarreau <w@....eu>
To: Matt Carlson <mcarlson@...adcom.com>
Cc: Roger Heflin <rogerheflin@...il.com>,
Peter Zijlstra <peterz@...radead.org>,
LKML <linux-kernel@...r.kernel.org>,
netdev <netdev@...r.kernel.org>
Subject: Re: WARNING: at net/sched/sch_generic.c:219 dev_watchdog+0xfe/0x17e() with tg3 network
Hi Matt,
I ran a lot of tests last night. I have a few more information.
The issue sometimes takes longer to reproduce so it caused me
to identify wrong culprits among the 29 patches affecting tg3
between 2.6.25 and 2.6.27.7. I was finally able to reproduce
the issue by running the plain 2.6.25 driver (v3.90) on 2.6.27.7,
but not at all when running on 2.6.25, even after ten minutes
(in 2.6.27.7, it takes between 5s and 1mn to get a tx timeout).
Later, I noticed that 2.6.27's driver uses libphy, which was
never removed between tests. I wonder if it can interfer with
my tests. Maybe it initializes the phy differently from plain
2.6.25, causing delayed issues, I don't know. Unfortunately,
I cannot run 2.6.27's driver on 2.6.25 because of the libphy
dependency (that's how I discovered it).
I'm also now 100% certain that enabling/disabling FC does not
change anything with either kernel. So unless the hardware still
interpretes pause frames when disabled, it should not come from
there.
I suspect that the switch is getting ill : The problem happens
more often when it's been transfering at full speed for some
time. Since it's a cheap one lying on a desk, it might have
burned out capacitors in it causing some randomly corrupt
frames to go out from time to time (maybe even pause frames
preventing the NIC from sending). That was also a problem
for my tests, because after patching/unpatching and compilation
phases, it had some time to rest and took longer to reproduce
the issue.
I will re-run some tests on 2.6.27 + tg3 v3.90 (from 2.6.25)
without ever loading libphy from the power up, in order to
clearly identify if the problem is caused by the driver or
something else in the kernel. If it's something else, the
bisect will take a few weeks since I'm not there long enough
to run about 15 full builds and wait long enough for the
problem to (not) occur.
But I'm keeping hope, there's no reason not to find it!
Regards,
Willy
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists