netdev - Re: sch_generic warn

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 20 Jul 2011 09:38:32 -0700 (Pacific Daylight Time)
From:	"Brandeburg, Jesse" <jesse.brandeburg@...el.com>
To:	Dave Jones <davej@...hat.com>
cc:	"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: sch_generic warn_on (timed out)

On Mon, 11 Jul 2011, Dave Jones wrote:
>             WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n",
>                       dev->name, netdev_drivername(dev, drivername, 64), i);
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=702723 is our 'master bug' that we're
> duping all others against. It seems to be showing up on a variety of different
> hardware (r8169, atl1c, ipheth, e1000e, 8139too). Do all these drivers need
> fixing ? or is it just 'crap hardware' ?

neither, probably

> note that I've only been looking through fedora 15 bugs so far (which is still on 2.6.38),
> but looking at the commit log for sch_generic, it doesn't seem that there's anything
> obvious that needs backporting.

it used to just be a KERN_ERR printk, then it changed it to be a WARN_ONCE 
in order to trigger kerneloops reports so we knew how many people were 
getting tx hangs from their hardware.

The bad news is there is never anything useful in the backtrace, besides 
what driver it is.  Users have been trained to send backtrace for panic 
messages, and in this case it doesn't help very much to identify what 
the problem was.

If the reports within each driver were able to be traced back to a 
specific *model* of hardware then that might be useful (particularly for 
Intel hardware).  Maybe the WARN_ONCE should print vendor/device pair so 
we would at least know the hardware from the panic trace.

If this is happening more frequently on F15 than F14 across multiple 
pieces of hardware, it may indicate that a kernel/stack change is starting 
to (ab)use a changed working model that is causing an issue, or that there 
is an actual kernel issue with locks or interrupts or tx completions that 
is causing an excessive delay in completion of transmits.

Dave can you query for F14 reports and/or isolate what kernel this started 
with?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html