netdev - Re: [E1000-devel] [Bugme-new] [Bug 12570] New: Bonding does not work over e1000e.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49BFB2F0.8090704@parallels.com>
Date:	Tue, 17 Mar 2009 17:25:52 +0300
From:	Konstantin Khorenko <khorenko@...allels.com>
To:	"Graham, David" <david.graham@...el.com>
CC:	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"e1000-devel@...ts.sourceforge.net" 
	<e1000-devel@...ts.sourceforge.net>,
	"devel@...ts.sourceforge.net" <devel@...ts.sourceforge.net>,
	"bonding-devel@...ts.sourceforge.net" 
	<bonding-devel@...ts.sourceforge.net>,
	"bugme-daemon@...zilla.kernel.org" <bugme-daemon@...zilla.kernel.org>
Subject: Re: [E1000-devel] [Bugme-new] [Bug 12570] New: Bonding does not	work
 over e1000e.

Hello David,

sorry for the huge delay, i'll try to answer your questions below.

On 02/17/2009 10:00 PM, Graham, David wrote:
> To get closer to your environment, I reconfigured my network, and same kernel & built-in driver that you used, but channel failover still works in my tests. Because this is without the recent serdes link patches that I referred to earlier, that means I don't expect them to be significant to the problem.

Unfortunately i don't have a direct access to the problematic node thus this takes so long time.
Yesterday at last the kernel 2.6.29-rc4 + patches SerdesSM.patch, disable_dmaclkgating.patch and RemoveRXSEQ.patch was tested and it works fine for the failback!

[root@...tname ~]# uname -a
Linux hostname 2.6.29-rc4.e1000e.ver1 #1 SMP Tue Feb 10 20:47:26 MSK 2009 x86_64 x86_64 x86_64 GNU/Linux
[root@...tname ~]#

##Took down the icbay5 uplink
[root@...tname ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth3
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:17:a4:77:00:1c

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:17:a4:77:00:1e

##Enable icbay5 uplink
[root@...tname ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth3
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:17:a4:77:00:1c

Slave Interface: eth3
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:17:a4:77:00:1e

## disable icbay6 (this is still working!!!  It used to die right here.
[root@...tname ~]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.5.0 (November 4, 2008)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth2
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth2
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:17:a4:77:00:1c

Slave Interface: eth3
MII Status: down
Link Failure Count: 1
Permanent HW addr: 00:17:a4:77:00:1e


> But there are some very significant differences in our setups, and I want to align my configuration closer to yours.
> 1) I am using a different Mezz card, with different EEPROM settings (and so features). Could you please send me "ethtool ethx" and "ethtool -e ethx" settings for the problem interfaces ? I may even spot something incorrect in the programming, but if not, I can probably use all or some of your content to make my card behave more like yours.

i've attached ethtool.info.gz file with the following commands output:
   # ethtool eth2
   # ethtool eth3
   # ethtool -e eth2
   # ethtool -e eth3
   # ethtool -i eth2
   # ethtool -i eth3

> 2) We have different link parters , and disable link in a different way.
> I tried to remove the switch modules as you did, but in my bladeserver system, couldn't. There must be some administrative command to allow the latch to unlock, but I am not familiar with it. I'll keep looking. Do you have the same (failing) result if you take the link partners down administratively from the switch console ?

Yes. The same failure occurs if we admin down the switch from the virtual connect.
But we have to do it for the whole switch.  Virtual Connect doesn't allow us to disable just one single port.

> FYI: here's more info & log that show's how the failover works OK on my system.
>
>       2.6.29-rc1 blade in bladeserver
>       Ping from console
>          |
>       +--------+
>       |  bond0 |  static address
>       ++------++
>        |      |
>    +---+--+  ++-----+
>    | eth2 |  | eth3 |
>    +---+--+  ++-----+
>        |      |        Serdes Backplane
>        |      |
>    +---+--+  ++-----+
>    | 5/4  |  | 6/4  | Bladeserver wwitch module/port
>    +---+--+  ++-----+
>        |      |
>     +--+------+----+
>     |  1GB switch  |  External to bladeserver
>     +-----+--------+
>           |
>     +-----+-------+
>     | ping target |
>     +-------------+

This is the same topology of our set up. We use the HP C7000 server chassis with the Virtual Connect enet-F module.

> 3) I am testing in a different chassis/backplane
> Let's address the simpler differences first, but if we go another round or two without being able to figure this out, and you are prepared to send us one of the systems with the problem for definite root cause analysis, you can contact me off-line from this bz and we'll work the detail.

Well, unfortunately sending the system for reproduction does not seem as an option, but if you need/want something to check, we can arrange a WebEx session. Please, let me know if this is needed.


Conclusions: well, the latest kernel with your patches does work, thank you very much, David!
Now i have to solve my original problem - to make RHEL5-based (2.6.18-x) kernel working.
At the moment RHEL5 kernel is affected by 2 issues:
1) that one which seems to be fixed by updating the testkernel from 2.6.29-r1 up to rc4 + 3 your patches.
2) when we break a link, mii status is still reported as "up" in /proc/net/bonding/bond1.
   (at the same time bonding changes the active slave to the working one correctly).

i understand there were a lot of changes since 2.6.18, but i still want to try not to replace the e1000e driver completely from the latest mainstream kernel, but to backport the set of patches to fix this exact issue.
Could you please help me pointing the patches that are essential to fix this issue (and probably issue 2)) from your point of view?

Thank you very much!

-- 
Best regards,

Konstantin Khorenko,
PVC/OpenVZ developer,
Parallels

Download attachment "ethtool.info.gz" of type "application/x-gzip" (4905 bytes)