linux-kernel - Re: 2.6.29-rc3: tg3 dead after resume

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.LFD.2.00.0901281636270.3123@localhost.localdomain>
Date:	Wed, 28 Jan 2009 17:09:14 -0800 (PST)
From:	Linus Torvalds <torvalds@...ux-foundation.org>
To:	Parag Warudkar <parag.lkml@...il.com>
cc:	netdev@...r.kernel.org,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: 2.6.29-rc3: tg3 dead after resume

On Wed, 28 Jan 2009, Parag Warudkar wrote:
>
> This is similar to the issue reported back in Jul 2007 - 
> http://kerneltrap.org/mailarchive/linux-kernel/2007/8/1/154073/thread 
> which was fixed with a patch to unconditionally save/restore pci config 
> space - that one is still in tg3.c.

In fact, the new PCI suspend/restore code should have made that 
unnecessary, since the PCI layer now makes sure that a save/restore is 
done even if the driver hadn't done it.

But at the same time, still having the driver do it certainly shouldn't 
have _hurt_ anything either. But it's quite possible that the tg3 thing is 
very sensitive to the exact order things happen in - there's a lot of 
comments about bugs in there ;)

> After resume tg3 complains that no firmware is running and eth0 is 
> non-existent. Rmmoding and modprobing tg3 again causes some timeouts and 
> errors from tg3 and the link still doesn't work.

That seems to imply that even the reset failed, which is interesting. 

But it also possibly means that the problem is not necessarily the driver 
itself, but some cached state that we keep around in "struct pci_dev" even 
across a module load/unload. 

For example, if we get the "dev->current_state" cache wrong, then we may 
not actually end up changing it when we should, because we think we 
already match the target state. I don't _think_ that is it, but that's the 
kind of thing that could happen.

Can you do a

	lspci -vvxxx -s [tg3-device]

before-and-after suspend? Is there some state that looks like it got 
corrupted?

			Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/