lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 2 Feb 2009 17:42:24 -0800
From:	"Graham, David" <david.graham@...el.com>
To:	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"e1000-devel@...ts.sourceforge.net" 
	<e1000-devel@...ts.sourceforge.net>,
	"devel@...ts.sourceforge.net" <devel@...ts.sourceforge.net>,
	"bonding-devel@...ts.sourceforge.net" 
	<bonding-devel@...ts.sourceforge.net>
CC:	"khorenko@...allels.com" <khorenko@...allels.com>,
	"bugme-daemon@...zilla.kernel.org" <bugme-daemon@...zilla.kernel.org>
Subject: RE: [E1000-devel] [Bugme-new] [Bug 12570] New: Bonding does not
	work over e1000e.

Hi Konstantin. 

I have been trying but so far been failed to reproduce the reported problem. 
I have a few questions.

1) While I can't repro your problem, I can see something very similar if I don't load with module load parameter miimon=100. From your /proc/net/bonding/bond1 dumps it looks like you do the right thing, but would you please confirm for me by listing exactly which bonding module params you load with. 

2) At 11:56:01 in the report you "turn on eth2 uplink on the virtual connect bay5", and I see in /proc/net/bonding/bond1 , immediately after that, eth2 still shows MII status *down*, which would be incorrect. Can you confirm that this snippet of the file really is in the correct place in the reported sequence - that is, there is already a problem at this step, and that's where we should look for the problem. 

3) Could you send me your network scripts for the two slaves and the bonding interface itself (on RH systems, I think that's /etc/sysconfig/networking-scripts/ifcfg-*. They should be modeled on the sample info in <kernel>Documentation/networking/bonding.txt, and I'd like to check them.

4) I'm probably not controlling the slave link state in the same way that you are, because in the NEC bladeserver that I'm using, I am bringing the e1000e link-partern ports up & down by using an admins console, using SW I don't understand. I (so far) have not been able to physically disconnect one of the (serdes) links connecting the 82571 without also disconnecting the other, as I have to pull an entire switch module to make the disconnect. Can you give me mmore information on what your system is, and how you can physically disconnect  on link at a time. Then I might be able to get hold of a similar setup and see the problem.

5) I have been testing on 2.6.29-rc3 and on 2.6.28 kernels, not 2.6.29-rc1 which is what you reported the problem on. I think its unlikely that the problem is only on the 2.6.29-rc1 build, but would like to know if you've had a chance to try any other build, and what the results were. Also please let me know if you have tested with any other non-INTEL 1GB interfaces, and if you have *ever* seen bonding work properly on the system you are testing. 

6) While I can't repro your issue yet, I have made some changes very recently to the serdes link detect logic in the e1000e driver. They were written to address a separate issue, and are actually NOT in the kernel that you have been testing. I also can't see how fixing that problem might fix your problem. However, because the fixes do concern serdes link detection, and so does yours, it's probably worth a (long) shot. If you are comfortable trying them out, I have attached them to this email. They are also being queued for upstream, but only after some further local testing.
 
Thanks
Dave

>-----Original Message-----
>From: Andrew Morton [mailto:akpm@...ux-foundation.org]
>Sent: Thursday, January 29, 2009 9:53 AM
>To: netdev@...r.kernel.org; e1000-devel@...ts.sourceforge.net; bonding-
>devel@...ts.sourceforge.net
>Cc: khorenko@...allels.com; bugme-daemon@...zilla.kernel.org
>Subject: Re: [E1000-devel] [Bugme-new] [Bug 12570] New: Bonding does not
>work over e1000e.
>
>
>(switched to email.  Please respond via emailed reply-to-all, not via the
>bugzilla web interface).
>
>On Thu, 29 Jan 2009 03:12:01 -0800 (PST) bugme-daemon@...zilla.kernel.org
>wrote:
>
>> http://bugzilla.kernel.org/show_bug.cgi?id=12570
>>
>>            Summary: Bonding does not work over e1000e.
>>            Product: Drivers
>>            Version: 2.5
>>      KernelVersion: 2.6.29-rc1
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: normal
>>           Priority: P1
>>          Component: Network
>>         AssignedTo: jgarzik@...ox.com
>>         ReportedBy: khorenko@...allels.com
>>
>>
>> Checked (failing) kernel: 2.6.29-rc1
>> Latest working kernel version: unknown
>> Earliest failing kernel version: not checked but probably any. RHEL5
>kernels
>> are also affected.
>>
>> Distribution: Enterprise Linux Enterprise Linux Server release 5.1
>(Carthage)
>>
>> Hardware Environment:
>> lspci:
>> 15:00.0 Ethernet controller: Intel Corporation 82571EB Quad Port Gigabit
>> Mezzanine Adapter (rev 06)
>> 15:00.1 Ethernet controller: Intel Corporation 82571EB Quad Port Gigabit
>> Mezzanine Adapter (rev 06)
>>
>> 15:00.0 0200: 8086:10da (rev 06)
>>         Subsystem: 103c:1717
>>         Flags: bus master, fast devsel, latency 0, IRQ 154
>>         Memory at fdde0000 (32-bit, non-prefetchable) [size=128K]
>>         Memory at fdd00000 (32-bit, non-prefetchable) [size=512K]
>>         I/O ports at 6000 [size=32]
>>         [virtual] Expansion ROM at d1300000 [disabled] [size=512K]
>>         Capabilities: [c8] Power Management version 2
>>         Capabilities: [d0] Message Signalled Interrupts: 64bit+
>Queue=0/0
>> Enable+
>>         Capabilities: [e0] Express Endpoint IRQ 0
>>         Capabilities: [100] Advanced Error Reporting
>>         Capabilities: [140] Device Serial Number 24-d1-78-ff-ff-78-1b-00
>>
>> 15:00.1 0200: 8086:10da (rev 06)
>>         Subsystem: 103c:1717
>>         Flags: bus master, fast devsel, latency 0, IRQ 162
>>         Memory at fdce0000 (32-bit, non-prefetchable) [size=128K]
>>         Memory at fdc00000 (32-bit, non-prefetchable) [size=512K]
>>         I/O ports at 6020 [size=32]
>>         [virtual] Expansion ROM at d1380000 [disabled] [size=512K]
>>         Capabilities: [c8] Power Management version 2
>>         Capabilities: [d0] Message Signalled Interrupts: 64bit+
>Queue=0/0
>> Enable+
>>         Capabilities: [e0] Express Endpoint IRQ 0
>>         Capabilities: [100] Advanced Error Reporting
>>         Capabilities: [140] Device Serial Number 24-d1-78-ff-ff-78-1b-00
>>
>> Problem Description: Bonding does not work over NICs supported by
>e1000e: if
>> you brake/restore physical links of bonding slaves one by one - network
>won't
>> work anymore.
>>
>> Steps to reproduce:
>> 2 NICs supported by e1000e put into bond device (Bonding Mode: fault-
>tolerance
>> (active-backup)).
>> * ping to the outside node is ok
>> * physically brake the link of active bond slave (1)
>> * bond detects the failure, makes another slave (2) active.
>> * ping works fine
>> * restore the connection of (1)
>> * ping works fine
>> * brake the link of (2)
>> * bond detects it, reports that it makes active (1), but
>> * ping _does not_ work anymore
>>
>> Logs:
>> /var/log/messages:
>> Jan 27 11:53:29 host kernel: 0000:15:00.0: eth2: Link is Down
>> Jan 27 11:53:29 host kernel: bonding: bond1: link status definitely down
>for
>> interface eth2, disabling it
>> Jan 27 11:53:29 host kernel: bonding: bond1: making interface eth3 the
>new
>> active one.
>> Jan 27 11:56:37 host kernel: 0000:15:00.0: eth2: Link is Up 1000 Mbps
>Full
>> Duplex, Flow Control: RX/TX
>> Jan 27 11:56:37 host kernel: bonding: bond1: link status definitely up
>for
>> interface eth2.
>> Jan 27 11:57:39 host kernel: 0000:15:00.1: eth3: Link is Down
>> Jan 27 11:57:39 host kernel: bonding: bond1: link status definitely down
>for
>> interface eth3, disabling it
>> Jan 27 11:57:39 host kernel: bonding: bond1: making interface eth2 the
>new
>> active one.
>>
>> What was done + dumps of /proc/net/bonding/bond1:
>> ## 11:52:42
>> ##cat /proc/net/bonding/bond1
>> Ethernet Channel Bonding Driver: v3.3.0 (June 10, 2008)
>>
>> Bonding Mode: fault-tolerance (active-backup)
>> Primary Slave: None
>> Currently Active Slave: eth2
>> MII Status: up
>> MII Polling Interval (ms): 100
>> Up Delay (ms): 0
>> Down Delay (ms): 0
>>
>> Slave Interface: eth2
>> MII Status: up
>> Link Failure Count: 0
>> Permanent HW addr: 00:17:a4:77:00:1c
>>
>> Slave Interface: eth3
>> MII Status: up
>> Link Failure Count: 0
>> Permanent HW addr: 00:17:a4:77:00:1e
>>
>> ## 11:53:05 shutdown eth2 uplink on the virtual connect bay5
>> ##cat /proc/net/bonding/bond1
>> Ethernet Channel Bonding Driver: v3.3.0 (June 10, 2008)
>>
>> Bonding Mode: fault-tolerance (active-backup)
>> Primary Slave: None
>> Currently Active Slave: eth3
>> MII Status: up
>> MII Polling Interval (ms): 100
>> Up Delay (ms): 0
>> Down Delay (ms): 0
>>
>> Slave Interface: eth2
>> MII Status: down
>> Link Failure Count: 1
>> Permanent HW addr: 00:17:a4:77:00:1c
>>
>> Slave Interface: eth3
>> MII Status: up
>> Link Failure Count: 0
>> Permanent HW addr: 00:17:a4:77:00:1e
>>
>> ## 11:56:01 turn on eth2 uplink on the virtual connect bay5
>> ##cat /proc/net/bonding/bond1
>> Ethernet Channel Bonding Driver: v3.3.0 (June 10, 2008)
>>
>> Bonding Mode: fault-tolerance (active-backup)
>> Primary Slave: None
>> Currently Active Slave: eth3
>> MII Status: up
>> MII Polling Interval (ms): 100
>> Up Delay (ms): 0
>> Down Delay (ms): 0
>>
>> Slave Interface: eth2
>> MII Status: down
>> Link Failure Count: 1
>> Permanent HW addr: 00:17:a4:77:00:1c
>>
>> Slave Interface: eth3
>> MII Status: up
>> Link Failure Count: 0
>> Permanent HW addr: 00:17:a4:77:00:1e
>>
>> ## 11:57:22 turn off eth3 uplink on the virtual connect bay5
>> ##cat /proc/net/bonding/bond1
>> Ethernet Channel Bonding Driver: v3.3.0 (June 10, 2008)
>>
>> Bonding Mode: fault-tolerance (active-backup)
>> Primary Slave: None
>> Currently Active Slave: eth2
>> MII Status: up
>> MII Polling Interval (ms): 100
>> Up Delay (ms): 0
>> Down Delay (ms): 0
>>
>> Slave Interface: eth2
>> MII Status: up
>> Link Failure Count: 1
>> Permanent HW addr: 00:17:a4:77:00:1c
>>
>> Slave Interface: eth3
>> MII Status: down
>> Link Failure Count: 1
>> Permanent HW addr: 00:17:a4:77:00:1e
>>
>
>
>--------------------------------------------------------------------------
>----
>This SF.net email is sponsored by:
>SourcForge Community
>SourceForge wants to tell your story.
>http://p.sf.net/sfu/sf-spreadtheword
>_______________________________________________
>E1000-devel mailing list
>E1000-devel@...ts.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/e1000-devel

Download attachment "SerdesSM.patch" of type "application/octet-stream" (6332 bytes)

Download attachment "disable_dmaclkgating.patch" of type "application/octet-stream" (1776 bytes)

Download attachment "RemoveRXSEQ.patch" of type "application/octet-stream" (1223 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ