netdev - RE: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <F169D4F5E1F1974DBFAFABF47F60C10A069DFD62@orsmsx507.amr.corp.intel.com>
Date:	Thu, 23 Oct 2008 15:42:55 -0700
From:	"Brandeburg, Jesse" <jesse.brandeburg@...el.com>
To:	Sanjoy Mahajan <sanjoy@....EDU>,
	Jesse Brandeburg <jesse.brandeburg@...il.com>
CC:	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	NetDEV list <netdev@...r.kernel.org>,
	"e1000-devel@...ts.sourceforge.net" 
	<e1000-devel@...ts.sourceforge.net>
Subject: RE: e1000e fails after several S3 resumes (2.6.26 Debian, TP T60) 

Sanjoy Mahajan wrote:
>> There is also lots of opportunity for BIOS bugs to be effecting
>> things so please make sure that you have the latest bios.
> 
> I was about to burn the CD to update the bios to 2.23 when the failure
> recurred.  So, with the caveat that the bios is still 2.20, I've
> attached logs from ethregs and ethtool before and after
>   ethtool -r eth0
> (which fixed the dhcp).
> 
> Here is the e1000e driver version:
> 
>   $ grep e1000e /var/log/dmesg
>   [   23.988317] e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k2
>   [   23.988390] e1000e: Copyright (c) 1999-2008 Intel Corporation.
>   [   23.988505] e1000e 0000:02:00.0: Disabling L1 ASPM

hm, does your kernel have CONFIG_PM defined?  if it happens again please include lspci -vvv before and after ethtool -r (see below)

> Here are diffs of the attached before and after logs:
> 
> --- ethtool-before.log	2008-10-23 09:14:41.000000000 -0400
> +++ ethtool-after.log	2008-10-23 09:17:54.000000000 -0400
> @@ -33,8 +33,8 @@
>        Pass MAC control frames:           don't pass
>        Receive buffer size:               2048
>  0x02808: RDLEN (Receive desc length)     0x00001000
> -0x02810: RDH   (Receive desc head)       0x000000BB
> -0x02818: RDT   (Receive desc tail)       0x000000B9
> +0x02810: RDH   (Receive desc head)       0x00000051
> +0x02818: RDT   (Receive desc tail)       0x0000004F

this indicates the device was actually receiving packets okay (RDH) and the
driver was returning buffers to hardware (RDT)

>  0x02820: RDTR  (Receive delay timer)     0x00000000
>  0x00400: TCTL (Transmit ctrl register)   0x3103F0FA
>        Transmitter:                       enabled
> @@ -42,7 +42,7 @@
>        Software XOFF Transmission:        disabled
>        Re-transmit on late collision:     enabled
>  0x03808: TDLEN (Transmit desc length)    0x00001000
> -0x03810: TDH   (Transmit desc head)      0x00000018
> -0x03818: TDT   (Transmit desc tail)      0x00000018
> +0x03810: TDH   (Transmit desc head)      0x00000075
> +0x03818: TDT   (Transmit desc tail)      0x00000075

device was also claiming successfully transmitting, so I don't know why 
the DHCP packets don't work, can you tcpdump on the network or the dhcp 
server by chance?  I'm looking to see if the server receives the transmits 
and then replies.

>  	RAL[0]         52411600
>  	RAH[0]         8000de50
> -	RAL[1]         00003333
> +	RAL[1]         005e0001
>  	RAH[1]         8000fb00
> -	RAL[2]         52ff3333
> -	RAH[2]         8000de50
> -	RAL[3]         00003333
> -	RAH[3]         80000100
> -	RAL[4]         005e0001
> +	RAL[2]         00003333
> +	RAH[2]         8000fb00
> +	RAL[3]         52ff3333
> +	RAH[3]         8000de50
> +	RAL[4]         00003333
>  	RAH[4]         80000100
> -	RAL[5]         00000000
> -	RAH[5]         00000000
> +	RAL[5]         005e0001
> +	RAH[5]         80000100

after resume, one multicast address is added and one is missing from the 
list of addresses the adapter will listen on.  I reordered but here are 
the diffs
before:
	RAL[5]         00000000
	RAH[5]         00000000
after
	RAL[5]         005e0001
	RAH[5]         8000fb00

I don't know which protocol added 01005e00fb as a multicast address only 
after suspend.

can you ifconfig eth0 promisc before doing suspend?  I'd be curious if 
that fixed it.

>  	RAL[6]         00000000
>  	RAH[6]         00000000
>  	RAL[7]         00000000
> @@ -390,7 +390,7 @@
>  	GSCL_2         00000000
>  	GSCL_3         00000000
>  	GSCL_4         00000000
> -	FACTPS         a1041046
> +	FACTPS         21041046

FACTPS bits are reserved in our manuals (but have to do with PCIe power state
changes), but I can't help but wonder if there isn't something with ASPM L0s or
L1 on your system (where we had trouble with that feature on your laptop) when
coming out of resume, therefore the lspci would show us the difference if there
was one.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html