[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.WNT.2.00.1107200931080.1084@JBRANDEB-DESK2.amr.corp.intel.com>
Date: Wed, 20 Jul 2011 09:38:32 -0700 (Pacific Daylight Time)
From: "Brandeburg, Jesse" <jesse.brandeburg@...el.com>
To: Dave Jones <davej@...hat.com>
cc: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: sch_generic warn_on (timed out)
On Mon, 11 Jul 2011, Dave Jones wrote:
> WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
> dev->name, netdev_drivername(dev, drivername, 64), i);
>
> https://bugzilla.redhat.com/show_bug.cgi?id=702723 is our 'master bug' that we're
> duping all others against. It seems to be showing up on a variety of different
> hardware (r8169, atl1c, ipheth, e1000e, 8139too). Do all these drivers need
> fixing ? or is it just 'crap hardware' ?
neither, probably
> note that I've only been looking through fedora 15 bugs so far (which is still on 2.6.38),
> but looking at the commit log for sch_generic, it doesn't seem that there's anything
> obvious that needs backporting.
it used to just be a KERN_ERR printk, then it changed it to be a WARN_ONCE
in order to trigger kerneloops reports so we knew how many people were
getting tx hangs from their hardware.
The bad news is there is never anything useful in the backtrace, besides
what driver it is. Users have been trained to send backtrace for panic
messages, and in this case it doesn't help very much to identify what
the problem was.
If the reports within each driver were able to be traced back to a
specific *model* of hardware then that might be useful (particularly for
Intel hardware). Maybe the WARN_ONCE should print vendor/device pair so
we would at least know the hardware from the panic trace.
If this is happening more frequently on F15 than F14 across multiple
pieces of hardware, it may indicate that a kernel/stack change is starting
to (ab)use a changed working model that is causing an issue, or that there
is an actual kernel issue with locks or interrupts or tx completions that
is causing an excessive delay in completion of transmits.
Dave can you query for F14 reports and/or isolate what kernel this started
with?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists