lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100112075059.GA6628@ff.dom.local>
Date:	Tue, 12 Jan 2010 07:50:59 +0000
From:	Jarek Poplawski <jarkao2@...il.com>
To:	David Miller <davem@...emloft.net>
Cc:	shemminger@...tta.com, mikem@...g3k.org, flyboy@...il.com,
	rjw@...k.pl, netdev@...r.kernel.org
Subject: Re: [Bug #14925] sky2 panic under load

On Mon, Jan 11, 2010 at 04:14:19PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@...il.com>
> Date: Mon, 11 Jan 2010 23:07:54 +0100
> 
> > I don't agree: you both try to make this check more specific.
> > Actually, since this problem is quite tricky, I wondered about
> > making it more genaral and visible for other maintainers by
> > adding something like netif_wake_queue_present() with the
> > check and some comment.
> 
> This detracts from the real problem.
> 
> The issue is that we have a code path bringing a device down, which
> uses helper routines which are meant to be executed when the device is
> up and functioning normally.
> 
> That's the bug.
> 
> No other driver does silly things like call helper routines which wake
> the TX queue when taking the chip down.
> 
> Fix that, not the immediate symptoms.  Write a routine that
> unconditionally clears the TX queue, frees the packets, etc. and
> has none of the wake logic.
> 
> That's what most other driver do, and those that don't should be
> fixed similarly.

As I wrote it's tricky and could be hard to track even if you "heard"
about it, e.g. Mike moves netif_wake_queue() to another place in his
last proposal, but it still can run after netif_device_detach() until 
napi_synchronize() in sky2_down().

I think, I can see similar problems e.g. in gianfar or netxen, where
napi_disable() is done after netif_device_detach(), especially in
suspend procedures (there might be less severe (than oops) effects
yet). IMHO, it all looks simply error prone (sometime you have to
know a driver well to track all possible paths to say it's really
safe).

In the meantime users are endangered with known oopses or at least
suspend problems, while we know the reasons, can fix it (even
temporarily) with one-liners, but wait for the right fixes. (Btw,
such fixes might require a significant rewrite, risking inserting
new bugs, so I doubt they are "right" for -stable.) Anyway, I'm
OK with this (I hope what should be done will be done, and I
don't have sky2 ;-)

Cheers,
Jarek P.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ