netdev - Re: [Bug #14925] sky2 panic under load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100112075059.GA6628@ff.dom.local>
Date:	Tue, 12 Jan 2010 07:50:59 +0000
From:	Jarek Poplawski <jarkao2@...il.com>
To:	David Miller <davem@...emloft.net>
Cc:	shemminger@...tta.com, mikem@...g3k.org, flyboy@...il.com,
	rjw@...k.pl, netdev@...r.kernel.org
Subject: Re: [Bug #14925] sky2 panic under load

On Mon, Jan 11, 2010 at 04:14:19PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@...il.com>
> Date: Mon, 11 Jan 2010 23:07:54 +0100
> 
> > I don't agree: you both try to make this check more specific.
> > Actually, since this problem is quite tricky, I wondered about
> > making it more genaral and visible for other maintainers by
> > adding something like netif_wake_queue_present() with the
> > check and some comment.
> 
> This detracts from the real problem.
> 
> The issue is that we have a code path bringing a device down, which
> uses helper routines which are meant to be executed when the device is
> up and functioning normally.
> 
> That's the bug.
> 
> No other driver does silly things like call helper routines which wake
> the TX queue when taking the chip down.
> 
> Fix that, not the immediate symptoms.  Write a routine that
> unconditionally clears the TX queue, frees the packets, etc. and
> has none of the wake logic.
> 
> That's what most other driver do, and those that don't should be
> fixed similarly.

As I wrote it's tricky and could be hard to track even if you "heard"
about it, e.g. Mike moves netif_wake_queue() to another place in his
last proposal, but it still can run after netif_device_detach() until 
napi_synchronize() in sky2_down().

I think, I can see similar problems e.g. in gianfar or netxen, where
napi_disable() is done after netif_device_detach(), especially in
suspend procedures (there might be less severe (than oops) effects
yet). IMHO, it all looks simply error prone (sometime you have to
know a driver well to track all possible paths to say it's really
safe).

In the meantime users are endangered with known oopses or at least
suspend problems, while we know the reasons, can fix it (even
temporarily) with one-liners, but wait for the right fixes. (Btw,
such fixes might require a significant rewrite, risking inserting
new bugs, so I doubt they are "right" for -stable.) Anyway, I'm
OK with this (I hope what should be done will be done, and I
don't have sky2 ;-)

Cheers,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html