netdev - Re: [PATCH] bonding: add mark mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Fri, 15 May 2009 14:58:00 -0700
From:	Jay Vosburgh <fubar@...ibm.com>
To:	Andy Gospodarek <andy@...yhouse.net>
cc:	Neil Horman <nhorman@...driver.com>, netdev@...r.kernel.org,
	bonding-devel@...ts.sourceforge.net
Subject: Re: [PATCH] bonding: add mark mode

Andy Gospodarek <andy@...yhouse.net> wrote:

>On Fri, May 01, 2009 at 02:51:42PM -0700, Jay Vosburgh wrote:
>> Andy Gospodarek <andy@...yhouse.net> wrote:
>> 
>> >On Fri, May 01, 2009 at 01:44:53PM -0700, Jay Vosburgh wrote:
>> >> Andy Gospodarek <andy@...yhouse.net> wrote:
>> >> 
>> >> >On Fri, May 01, 2009 at 12:52:16PM -0700, Jay Vosburgh wrote:
>> >> >> Neil Horman <nhorman@...driver.com> wrote:
>> >> >> 
>> >> >> >On Fri, May 01, 2009 at 01:39:16PM -0400, Andy Gospodarek wrote:
>> >> >> >> 
>> >> >> >> Quite a few people are happy with the way bonding manages to split
>> >> >> >> traffic to different members of a bond.  Many are quite disappointed
>> >> >> >> that users or administrators cannot be given more control over the
>> >> >> >> traffic distribution and would like something that they can control more
>> >> >> >> easily.  I looked at extending some of the existing modes, but the
>> >> >> >> cleanest option seemed to be one that created an additional mode to
>> >> >> >> handle this case.  I hated to create yet another mode, but the
>> >> >> >> simplicity of this mode made it a nice candidate for a new mode.  I have
>> >> >> >> decided to call this mode 'mark' (or mode 7).
>> >> >> >> 
>> >> >> >> The mark mode of bonding relies on the skb->mark field for outgoing
>> >> >> >> device selection.  Unmarked frames (ones where the mark is still zero),
>> >> >> >> will be sent by the first active enslaved device.  Any marked frames
>> >> >> >> will choose the outgoing device based on result of the modulo of the mark
>> >> >> >> and the number of enslaved devices.  If that device is inactive
>> >> >> >> (link-down), the traffic will default back to the first active enslaved
>> >> >> >> device.  I debated how to use the mark to decide the outgoing device,
>> >> >> >> but it seemed that modulo of the mark and the number of enslaved devices
>> >> >> >> would provide the most flexibility for those who currently mark frames
>> >> >> >> for other purposes.
>> >> >> >> 
>> >> >> >> I considered some other options for choosing destination devices based
>> >> >> >> on marks, but the ones I came up would require additional sysfs
>> >> >> >> configuration parameters and I would prefer not to add any more to an
>> >> >> >> already crowded space.
>> >> >> >> 
>> >> >> >> I've tested this on a slightly older kernel than the net-next-2.6 tree
>> >> >> >> than this patch is against by marking frames using mark and connmark
>> >> >> >> iptables options and it seems to work as I expect.
>> >> >> >> 
>> >> >> >> Signed-off-by: Andy Gospodarek <andy@...yhouse.net>
>> >> >> >> ---
>> >> >> >> 
>> >> >> >>  Documentation/networking/bonding.txt |   25 ++++++++++++++
>> >> >> >>  drivers/net/bonding/bond_main.c      |   59 ++++++++++++++++++++++++++++++++++-
>> >> >> >>  include/linux/if_bonding.h           |    1 
>> >> >> >>  3 files changed, 84 insertions(+), 1 deletion(-)
>> >> >> >> 
>> >> >> >> diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
>> >> >> >> index 0876275..7a0d4c2 100644
>> >> >> >> --- a/Documentation/networking/bonding.txt
>> >> >> >> +++ b/Documentation/networking/bonding.txt
>> >> >> >> @@ -582,6 +582,31 @@ mode
>> >> >> >>  		swapped with the new curr_active_slave that was
>> >> >> >>  		chosen.
>> >> >> >>  
>> >> >> >> +	mark or 7
>> >> >> >> +
>> >> >> >> +		Mark-based policy: skbuffs that arrive to be
>> >> >> >> +		transmitted will have the mark field inspected to
>> >> >> >> +		determine the destination slave device.  When the
>> >> >> >> +		skbuff's mark is zero, the first active device in the
>> >> >> >> +		ordered list of enslaved devices will be used.  When
>> >> >> >> +		the mark is non-zero the modulo of the mark and the
>> >> >> >> +		number of enslaved devices will determine the
>> >> >> >> +		interface used for transmission.  If this device is
>> >> >> >> +		not active (link-down) then the mark will essentially
>> >> >> >> +		be ignored and the first active device in the ordered
>> >> >> >> +		list of enslaved devices will be used.
>> >> >> >> +
>> >> >> >> +		The flexibility offered with this mode allows users
>> >> >> >> +		of netfilter to move various types of traffic to
>> >> >> >> +		different slaves quite easily.  Information on this
>> >> >> >> +		can be found in the manpages for iptables/ebtables
>> >> >> >> +		as well as netfilter documentation.
>> >> >> >> +
>> >> >> >> +		Prerequisites:
>> >> >> >> +
>> >> >> >> +		1.  Without the ability to mark skbuffs this mode is
>> >> >> >> +		not useful.  Netfilter greatly aides skbuff marking.
>> >> >> >> +
>> >> >> >>  num_grat_arp
>> >> >> >>  
>> >> >> >>  	Specifies the number of gratuitous ARPs to be issued after a
>> >> >> >> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> >> >> >> index fd73836..5e1d166 100644
>> >> >> >> --- a/drivers/net/bonding/bond_main.c
>> >> >> >> +++ b/drivers/net/bonding/bond_main.c
>> >> >> >> @@ -123,7 +123,7 @@ module_param(mode, charp, 0);
>> >> >> >>  MODULE_PARM_DESC(mode, "Mode of operation : 0 for balance-rr, "
>> >> >> >>  		       "1 for active-backup, 2 for balance-xor, "
>> >> >> >>  		       "3 for broadcast, 4 for 802.3ad, 5 for balance-tlb, "
>> >> >> >> -		       "6 for balance-alb");
>> >> >> >> +		       "6 for balance-alb, 7 for mark");
>> >> >> >>  module_param(primary, charp, 0);
>> >> >> >>  MODULE_PARM_DESC(primary, "Primary network device to use");
>> >> >> >>  module_param(lacp_rate, charp, 0);
>> >> >> >> @@ -175,6 +175,7 @@ const struct bond_parm_tbl bond_mode_tbl[] = {
>> >> >> >>  {	"802.3ad",		BOND_MODE_8023AD},
>> >> >> >>  {	"balance-tlb",		BOND_MODE_TLB},
>> >> >> >>  {	"balance-alb",		BOND_MODE_ALB},
>> >> >> >> +{	"mark",			BOND_MODE_MARK},
>> >> >> >>  {	NULL,			-1},
>> >> >> >>  };
>> >> >> >>  
>> >> >> >> @@ -224,6 +225,7 @@ static const char *bond_mode_name(int mode)
>> >> >> >>  		[BOND_MODE_8023AD]= "IEEE 802.3ad Dynamic link aggregation",
>> >> >> >>  		[BOND_MODE_TLB] = "transmit load balancing",
>> >> >> >>  		[BOND_MODE_ALB] = "adaptive load balancing",
>> >> >> >> +		[BOND_MODE_MARK] = "mark-based transmit balancing",
>> >> >> >>  	};
>> >> >> >>  
>> >> >> >>  	if (mode < 0 || mode > BOND_MODE_ALB)
>> >> >> >> @@ -4464,6 +4466,57 @@ out:
>> >> >> >>  	return 0;
>> >> >> >>  }
>> >> >> >>  
>> >> >> >> +static int bond_xmit_mark(struct sk_buff *skb, struct net_device *bond_dev)
>> >> >> >> +{
>> >> >> >> +	struct bonding *bond = netdev_priv(bond_dev);
>> >> >> >> +	struct slave *slave;
>> >> >> >> +	int i, slave_no, res = 1;
>> >> >> >> +
>> >> >> >> +	read_lock(&bond->lock);
>> >> >> >> +
>> >> >> >> +	if (!BOND_IS_OK(bond)) {
>> >> >> >> +		goto out;
>> >> >> >> +	}
>> >> >> >> +
>> >> >> >> +	/* Use the mark as the determining factor for which slave to
>> >> >> >> +	 * choose for transmission.  When behaving normally all should
>> >> >> >> +	 * work just fine.  When a slave that is destined to be the
>> >> >> >> +	 * transmitter of this frame is down, start at the front of the
>> >> >> >> +	 * list and find the first available slave. */
>> >> >> 
>> >> >> 	Why not simply use the N'th up slave instead of reverting to
>> >> >> slave 0 for the down slave case?  I'm guessing this has to do with
>> >> >> trying to maintain some fixed balancing of traffic even in the face of
>> >> >> slave failure.
>> >> >> 
>> >> >
>> >> >That could be a reasonable option as well -- in fact that was my first
>> >> >implementation.  I decided to do it this way, so that a user could
>> >> >consider the first device as the primary slave for unmarked traffic and
>> >> >the slave for marked traffic destined for a slave that was down rather
>> >> >than what might seem like a random choice for traffic.
>> >> 
>> >> 	Ok, understood.
>> >> 
>> >> >> >> +	slave_no = skb->mark ? skb->mark % bond->slave_cnt : 0;
>> >> >> >> +
>> >> >> >Would it be worthwhile to add a special case here (say all f's in mark, to
>> >> >> >indicate a frames should be sent out all slaves on the bond?  In the case you
>> >> >> >have traffic that might need to go to all interface (like maybe igmp)?
>> >> >> 
>> >> >> 	If I'm not misunderstanding the purpose of the mode, I think
>> >> >> it's etherchannel compatible (meaning that the switch has to be
>> >> >> configured), so I'm not sure why there would ever be a need to flood
>> >> >> packets to all ports.
>> >> >> 
>> >> >> 	I think this would be generally be better a special hash policy,
>> >> >> in which case both the etherchannel (balance-xor) and 802.3ad modes
>> >> >> could take advantage of it.  I'd hazard to guess that Andy thought about
>> >> >> that, too, so what was the impediment?
>> >> >> 
>> >> >
>> >> >I wasn't designing this to be etherchannel compatible.  Though it could
>> >> >work that way, my idea was to use this in an environment where slaves
>> >> >may be connected to different switches that are capable of moving all of
>> >> >the traffic to it's destination if needed, but if the traffic is split
>> >> >in a predictable application specific manner the network can run more
>> >> >efficiently.  Think of this more as an active-backup mode where the
>> >> >backup links could actually be utilized.
>> >> 
>> >> 	Hmm.  If the slaves don't connect to the same places, why use
>> >> bonding at all, then?  Or, perhaps, two smaller bonds instead of one big
>> >> one.
>> >> 
>> >> 	Are you thinking of something like a load balancer back end, or
>> >> am I missing the point entirely?
>> >> 
>> >> 	And, really, active-backup using the backup links is just a load
>> >> balancing mode.  I'm not sure I see the special distinction there.
>> >> 
>> >
>> >OK, let's forget I mentioned active-backup. :)
>> 
>> 	Fair enough!
>> 
>> >The best way to think about this is a user-configured load-balancing
>> >mode.  It can work with switches that are unaware of the existence of
>> >the bond.  This marking mode transmit could also be used for xor and
>> >802.3ad if there was a desire, but my original intent (and continued
>> >desire) was to create where the user controlled the load balancing and
>> >the option of using single or multiple switches could exist.  I've
>> >certainly seen network topolgies where this would be useful.  I can take
>> >some time and draw one of those up if you like (I'm just pressed for
>> >time right now).
>> 
>> 	I'm pretty sure there would be interest for a mark policy in
>> 802.3ad.
>> 
>> >> >When I first started on this I considered just extending the
>> >> >active-backup mode's transmit function, but I didn't want to deal with
>> >> >the potential loss of duplicate frames in skb_bond_should_drop and I
>> >> >also didn't want to deal with the complicated failover scenarios of
>> >> >active-backup mode.
>> >> >
>> >> >It's probably the most like round-robin (in fact it's quite close), but
>> >> >I felt that this mode could eventually grow into something more useful
>> >> >(who knows how it might be extended) and deserved it's own mode rather
>> >> >than dealing with the possibility of having a larger more complicated
>> >> >transmit routine in the future when another mode would be just fine.
>> >> 
>> >> 	I'm still not quite sure how this differs from balance-xor mode
>> >> with a hypothetical "xmit_hash_policy=mark" sort of arrangement, other
>> >> than, perhaps, intended purpose.
>> >> 
>> >
>> >XOR is designed to work with switches that are aware of the bond, right?
>> >This does not have the same requirement and I wanted to make sure it
>> >could work with switches that were unaware (or possibly separate
>> >switches).
>> 
>> 	Well, theoretically, yes, but all "etherchannel compatible"
>> means is that all of the slaves are set to the same MAC address, and
>> we're prepared to accept traffic for the bond from any of the slaves.
>> Your new mode does both of those, and etherchannel compatible doesn't
>> mean "won't work without etherchannel."  There's no voodoo for
>> etherchannel beyond this (it's really not very sophisticated).
>> 
>> 	The 802.3ad protocol, on the other hand, does have what you're
>> worred about: handshaking and various chit chat between the bond and the
>> switch.
>> 
>> 	The xor mode would work fine, for example, in a configuration
>> with two switches connecting to discrete networks, and arranged for the
>> hash (or mark, hint hint) to put the traffic for each network on the
>> proper slave.  If the networks converge somewhere down the line, there
>> may be loop issues, but that's not really a new thing.  If the networks
>> don't converge, then each of the switches just sees it own little
>> etherchannel compatible world (of however many slaves are plugged into
>> that switch).
>> 
>
>Jay,
>
>Sorry for the delayed response.  Here are my thoughts.
>
>My initial impression was that XOR was a bit more complex, but the
>reality is the code is not. There's really just not much to it.
>I would see no reason why the mark-based mode (mode 7) could be added
>and the code that does slave selection could be shared, so that a

	Presumably you mean "could not be added."

>mark-based hashing algorithm could be used for XOR and 802.3ad.  It
>could also be that the new mode is abandoned all-together and just the
>hashing algorithm is used (as you have suggested in previous emails).
>
>The only real problem I see with not creating a new mode is what happens
>when new options to control mark-based behavior are added with the
>intent of not using them in an 802.3ad scenario.  Initially these are
>the ones I've planned (and I'm sure they could grow from here):

	I'm happy with having hash modes that are "use at own risk in
802.3ad"; there's one there already that's selectable but not fully
standard compliant.  As long as the documentation is clear on this, I
have no issue with it.

>	mark_default - Modify the patch I've sent already to send all
>	traffic marked for an interface that is down to the specified
>	interface (rather than the 'next' in the list -- which would be
>	the new default behavior).  This would take a single device
>	argument.  (This could also just be an overload of the 'primary'
>	directive if desired.)
>
>	mark_order - This would be a list of interfaces to use as the
>	order for the marks used.  You can bet that users will find
>	mark mode/hashing useful, but would rather use interfaces in the
>	order they specify like eth2, eth0, eth1 instead of the
>	enslavement order (which would normally be eth0, eth1, eth2).
>	This parameter would take multiple arguments like
>	/sys/class/net/bondX/bonding/slaves currently does.
>
>So the decision that is left is whether we want bonding arguments like
>this:
>
>mode=balance-xor xmit_hash_policy=mark
>
>or
>
>mode=mark
>
>initially with the chance to grow into something like this:
>
>mode=balance-xor xmit_hash_policy=mark mark_default=eth0 mark_order="eth0, eth1, eth2"
>
>rather than
>
>mode=mark mark_default=eth0 mark_order="eth0, eth1, eth2"
>
>To me the shorter one seems better than leaves more room for expansion
>of features specific to the marking mode, but I can also understand
>the potential lack of interest in another bonding mode since it's just
>more code to maintain when something breaks or when changing the way
>transmission is done.

	It's not that I'm trying to avoid adding a mode, it's just that
this functionality looks to have application in multiple existing modes.
It therefore seems to me that it's better as a modifier for existing
modes (xor and 802.3ad), rather than a discrete mode (that would
interoperate only with etherchannel, and not 802.3ad).

	Or, instead of "mark," create a new mode that's called
"etherchannel" and figure out a framework to have options sets for the
various slave selection gizmos (hash, mark, round-robin, anything else
coming up) apply generically to the new "etherchannel" and existing
"802.3ad" modes.

	For 802.3ad, the schemes that are not standards compliant are
either "use at own risk" or prohibited, probably on a case-by-case
basis.  E.g., Just about any strict hash is probably ok, but round-robin
is probably never appropriate for 802.3ad.  Or, now that I ruminate a
bit, just print a "whoa: not standards compliant" warning and let 'em go
for it.  If people wanna run round-robin over their 802.3ad aggregations
in the privacy of their own datacenter, who am I to complain?

	In the above, balance-xor, balance-rr and probably broadcast are
aliases to some particular set of options for "etherchannel" mode.  I'm
not really sure what broadcast is really used for (I suspect it's not
really used at all), but the choice is either drop it or treat it
equally, so treat it equally.

	Yah, I know you're not really thinking necessarily in terms of
Etherchannel compatibility, but if all the slaves always have the same
MAC and traffic is accepted equally on any slave, that's Etherchannel
compatible.  Even if you run some weird split-peer arrangement on
discrete switch segments (and, honestly, I'm still not too clear on why
that's better than two bonds, one to each switch).

>Of course I won't deny that I think it would be fun to be 'the guy that
>created that new mode of bonding,' but I'm not sure that would really
>impress anyone anyway. :-)
>
>Let me know what you think,

	I'm kinda liking my random idea (new "etherchannel" mode) on
first thought.  I'm not yet 100% sure it's really less bad that what's
there now, though.  On the up side, the nomenclature is clearer, and,
most importantly, you still get to be famous as having created a new
mode.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html