lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4DAF1F1D.1020108@canonical.com>
Date:	Wed, 20 Apr 2011 11:59:57 -0600
From:	Tim Gardner <tim.gardner@...onical.com>
To:	Ben Hutchings <bhutchings@...arflare.com>
CC:	netdev <netdev@...r.kernel.org>
Subject: Fix atl1c event race (was Re: 2.6.38 dev_watchdog WARNING)

On 04/19/2011 12:49 PM, Ben Hutchings wrote:
> On Tue, 2011-04-19 at 11:40 -0600, Tim Gardner wrote:
>> I'm seeing a lot of these kinds of bugs: WARNING: at
>> /build/buildd/linux-2.6.38/net/sched/sch_generic.c:256
>> dev_watchdog+0x213/0x220()
>>
>> The kernel is 2.6.38.2 plus Ubuntu cruft.
>>
>> A spot check of the 200+ hits on this string indicates they are
>> primarily due to these drivers:
>>
>> ipheth
>> atl1c
>> sis900
>> r8169
>>
>> As far as I can tell the warning happens when link is down on the media
>> (and has never been link UP) and are sent a transmit packet which never
>> completes. Is there a net/core or net/sched requirement to which these
>> drivers do not conform ? Are they not correctly indicating link status?
>
> The watchdog fires when the software queue has been stopped *and* the
> link has been reported as up for over dev->watchdog_timeo ticks.
>
> The software queue should be stopped iff the hardware queue is full or
> nearly full.  If the software queue remains stopped and the link is
> still reported up, then one of these things is happening:
>
> 1. The link went down but the driver didn't notice
> 2. TX completions are not being indicated or handled correctly
> 3. The hardware TX path has locked up
> 4. The link is stalled by excessive pause frames or collisions
> 5. Timeout is too low and/or low watermark is too high
> (there may be other explanations)
>
> I think the watchdog is primarily meant to deal with case 3, though all
> of cases 1-3 may be worked around by resetting the hardware.
>
> Ben.
>

I've been focusing on atl1c while trying to understand why link status 
flapping could cause these watchdog timeouts. I've a couple of log files 
with link state change information:

http://bugs.launchpad.net/bugs/766273
https://launchpadlibrarian.net/69926580/BootDmesg.txt
https://launchpadlibrarian.net/69926583/CurrentDmesg.txt

One thing of note is that there are 2 link UP messages in a row, 
something that should only be able to happen if there has been an 
intervening device reset (which is not evident in the logs). I've 
noticed that the work event scheduling is kind of racy, so perhaps this 
will help. See attached.

rtg
-- 
Tim Gardner tim.gardner@...onical.com

View attachment "0001-atl1c-Fix-work-event-interrupt-task-races.patch" of type "text/x-patch" (2791 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ