[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20007.1306283668@death>
Date: Tue, 24 May 2011 17:34:28 -0700
From: Jay Vosburgh <fubar@...ibm.com>
To: Neil Horman <nhorman@...driver.com>
cc: =?us-ascii?B?PT9JU08tODg1OS0xP1E/Tmljb2xhc19kZV9QZXNsbz1GQ2FuPz0=?=
<nicolas.2p.debian@...il.com>,
Andy Gospodarek <andy@...yhouse.net>, netdev@...r.kernel.org,
"David S. Miller" <davem@...emloft.net>
Subject: Re: [PATCH] bonding: prevent deadlock on slave store with alb mode
Neil Horman <nhorman@...driver.com> wrote:
>On Tue, May 24, 2011 at 02:12:54PM -0700, Jay Vosburgh wrote:
>> Nicolas de Pesloüan <nicolas.2p.debian@...il.com> wrote:
>>
>> >Le 24/05/2011 22:37, Neil Horman a écrit :
>> >
>> >>>>> + return -EINVAL;
>> >>>
>> >>> This will turn a warning into an error.
>> >>>
>> >> Yes, because it should have been an error all along.
>> >>
>> >>> This warning existed for long, but never caused the bonding setup to
>> >>> fail. This patch cause some regression for user space. For example,
>> >>> current ifenslave-2.6 package in Debian doesn't ensure bond is UP
>> >>> before enslaving, because this was never required.
>> >>>
>> >> Thats not a regression, thats the kernel returning an error where it should have
>> >> done so all along. Just because a utility got away with it for awhile and it
>> >> didn't always cause a lockup, doesn't grandfather that application in to a
>> >> situation where the kernel has to support its broken behavior in perpituity.
>> >>
>> >> Besides, iirc, the ifsenslave utility still uses the ioctl path, which this
>> >> patch doesn't touch, so ifenslave is currently unaffected (although I should
>> >> look in the ioctl path to see if we have already added such a check, lest you be
>> >> able to deadlock your system as previously indicated using that tool).
>> >
>> >Unfortunately, no. Recent versions of ifenslave-2.6 on Debian don't use
>> >ioctl (ifenslave binary) anymore, but only sysfs.
>> >
>> >Documentation/bonding.txt should be updated to reflect this change.
>> >pr_warning should be changed to pr_ err.
>> >Bonding version should be bumped.
>> >
>> >Anyway, I will fix this package, but I suspect there exist many user
>> >scripts that don't ensure bond is up before enslaving.
>>
>> I looked at sysconfig (as supplied with opensuse) and it uses
>> sysfs, and does set the master device up first. The other potential
>> user that comes to mind is that OFED at one point had a script to set up
>> bonding for Infiniband devices. I don't know if this is still the case,
>> nor do I know if it set the bond device up before enslaving.
>>
>> Generally speaking, though, in the long run I think it should be
>> permissible to change any bonding option when the bond is down (even to
>> values that make no sense in context, e.g., setting the primary to a
>> device not currently enslaved). My rationale here is that some options
>> are very difficult to modify when the bond is up (e.g., changing the
>> mode), and now some other set is precluded when the bond is down. The
>> init scripts already have repeat logic in them; this just makes things
>> more complicated.
>>
>> There should be a state wherein any option can be changed (well,
>> maybe not max_bonds), and that should be the down state. A subset can
>> also be changed while up. I'd be happy to be able to change all options
>> while the bond is up, too, but that seems pretty hard to do.
>>
>> How much harder is it to fix the locking and permit the action
>> in question here?
>>
>In this case, to just hack something in place is pretty easy, I can just
>initalize the spinlocks for all cases in the bond_create path. But to do in any
>sort of sensical way is much harder, since the code is written such that you
>initialize various relevant data structures based on the mode of the bond,
>which, as you indicated above, you want the right to change up until the point
>where you ifup the bonded interface.
>
>The whole thing is predicated on the notion that
>transitioning from the down to up state is the gating factor to initializing the
>current configuration. What might work is an in-between state in which you commit
>and initialize a bond based on the current configuation. Doing so would allow
>you to (re-)initialize a bond configuration in a safe state. Only after
>commiting a configuration could you enslave devices or ifup the bond. Once up,
>further commits would be non-permissable until the bond was brought down again.
>Of course, this would also require changing the semantics of the user space
>tools.
One alternative is to simply permit all option changes while the
bond is down, then commit then as a set when the bond is brought up (or
fail the open or adjust options if the options are invalid). More or
less what you suggest, except that the "commit" is the ifup.
Even that is a substantial rework of how things are done now,
since as you point out, options today take effect more or less in real
time.
>This also begs the question, is it or is it not safe to enslave devices while
>the bond is down? Clearly from the bug report its unsafe, and I don't know what
>other (if any) conditions exist that cause problems when doing this (be that a
>deadlock, panic or simply undefined or unexpected behavior). If its really
>unsafe, then issuing a warning seems incorrect, we shouldn't allow user space to
>cause things like this, and as such, we should return an error. If it is safe
>(generally) and this is an isolated bug, then we should probably remove the
>warning. But to just issue a vague 'This might do bad things' warning seems
>wrong in either case.
Agreed. I think it (conceptually) should be safe to add or
remove slaves when the bond is down, and bonding shouldn't complain
about it.
Ok, I looked through the log, and originally enslaving while
down via sysfs was disallowed when that code was added.
Later, enslaving while down was enabled in a patch for
Infiniband. Reading the changelog explains why; for IB, enslaving while
up messed up the multicast group addresses:
commit 6b1bf096508c870889c2be63c7757a04d72116fe
Author: Moni Shoua <monis@...taire.com>
Date: Tue Oct 9 19:43:40 2007 -0700
net/bonding: Enable IP multicast for bonding IPoIB devices
Allow to enslave devices when the bonding device is not up. Over the discussion
held at the previous post this seemed to be the most clean way to go, where it
is not expected to cause instabilities.
Normally, the bonding driver is UP before any enslavement takes place.
Once a netdevice is UP, the network stack acts to have it join some multicast groups
(eg the all-hosts 224.0.0.1). Now, since ether_setup() have set the bonding device
type to be ARPHRD_ETHER and address len to be ETHER_ALEN, the net core code
computes a wrong multicast link address. This is b/c ip_eth_mc_map() is called
where for multicast joins taking place after the enslavement another ip_xxx_mc_map()
is called (eg ip_ib_mc_map() when the bond type is ARPHRD_INFINIBAND)
Signed-off-by: Moni Shoua <monis at voltaire.com>
Signed-off-by: Or Gerlitz <ogerlitz at voltaire.com>
Acked-by: Jay Vosburgh <fubar@...ibm.com>
Signed-off-by: Jeff Garzik <jeff@...zik.org>
Assuming this situation hasn't changed (and I'm not sure how it
could, because the first IB slave causes a type change), I don't believe
we can disallow enslaving while the bond is down because IB depends on
it.
>I'll respin the patch to just initialize the spinlocks in the morning, if thats
>what will fix the deadlock, but it really seems like the wrong way to go to me.
>If enslaving devices to a bond while its down has been known to cause problems,
>then we shouldn't allow it and we should update the user space tools to
>understand and handle that.
This is the first I've heard of it causing a panic; looking
back, I can also see some issues with the warning itself (because it
happens prior to the rtnl_trylock, the message may repeat a lot if rtnl
is contended), so I'm happy to see the warning go away either way.
-J
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists