[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAPc2XZDQxtDKsJ8F=Z5WoPz5qhdfi2GQVP6-ykdWAnw=JUJx3w@mail.gmail.com>
Date: Thu, 23 Aug 2012 13:23:12 +0100
From: Jeremy Brookman <jeremy.brookman@...il.com>
To: John Eaglesham <linux@...2.net>
Cc: netdev@...r.kernel.org
Subject: Re: [PATCH v8] bonding: support for IPv6 transmit hashing
Thanks for getting this in John. Apologies for my earlier reply,
where I hadn't spotted this revision of the patch; it looks like the
comments I made have been addressed, and all is well.
Thanks again,
Jeremy
On Wed, Aug 22, 2012 at 7:43 AM, John Eaglesham <linux@...2.net> wrote:
> From: John Eaglesham <linux@...2.net>
>
> Currently the "bonding" driver does not support load balancing outgoing
> traffic in LACP mode for IPv6 traffic. IPv4 (and TCP or UDP over IPv4)
> are currently supported; this patch adds transmit hashing for IPv6 (and
> TCP or UDP over IPv6), bringing IPv6 up to par with IPv4 support in the
> bonding driver. In addition, bounds checking has been added to all
> transmit hashing functions.
>
> The algorithm chosen (xor'ing the bottom three quads of the source and
> destination addresses together, then xor'ing each byte of that result into
> the bottom byte, finally xor'ing with the last bytes of the MAC addresses)
> was selected after testing almost 400,000 unique IPv6 addresses harvested
> from server logs. This algorithm had the most even distribution for both
> big- and little-endian architectures while still using few instructions. Its
> behavior also attempts to closely match that of the IPv4 algorithm.
>
> The IPv6 flow label was intentionally not included in the hash as it appears
> to be unset in the vast majority of IPv6 traffic sampled, and the current
> algorithm not using the flow label already offers a very even distribution.
>
> Fragmented IPv6 packets are handled the same way as fragmented IPv4 packets,
> ie, they are not balanced based on layer 4 information. Additionally,
> IPv6 packets with intermediate headers are not balanced based on layer
> 4 information. In practice these intermediate headers are not common and
> this should not cause any problems, and the alternative (a packet-parsing
> loop and look-up table) seemed slow and complicated for little gain.
>
> Tested-by: John Eaglesham <linux@...2.net>
> Signed-off-by: John Eaglesham <linux@...2.net>
>
> ---
>
> Changes:
> v2)
> * Clarify description
> * Add bounds checking to more functions
> * All functions call bond_xmit_hash_policy_l2 rather than re-
> implement the same logic.
> v3)
> * Patch against net-next.
> * Style corrections.
> v4)
> * Correct indenting.
> v5)
> * Squash documentation and code patches into one.
> v6)
> * Modify IPv6 hash to behave more like the IPv4 hash, update
> documentation with modified algorithm.
> * Clean up formatting.
> * Move all variable declaration to the top of the function.
> * Minor change to IPv6 layer 4 hash to match IPv4 hash behavior
> (mix all hashed address bits together rather than just the
> bottom 24 bits).
> v7)
> * Improve bounds checking code (handle truncated IPv6 header,
> removed goto, fewer if statements).
> * Re-write pseudocode in documentation to match actual code more
> closely.
> * Correct indenting, align parentheses, wrap code at <= 80 columns
> (based on Jay's changes).
> v8)
> * Correct patch submission format.
>
> Documentation/networking/bonding.txt | 30 ++++++++++--
> drivers/net/bonding/bond_main.c | 89 +++++++++++++++++++++++++-----------
> 2 files changed, 88 insertions(+), 31 deletions(-)
>
> diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
> index 6b1c711..10a015c 100644
> --- a/Documentation/networking/bonding.txt
> +++ b/Documentation/networking/bonding.txt
> @@ -752,12 +752,22 @@ xmit_hash_policy
> protocol information to generate the hash.
>
> Uses XOR of hardware MAC addresses and IP addresses to
> - generate the hash. The formula is
> + generate the hash. The IPv4 formula is
>
> (((source IP XOR dest IP) AND 0xffff) XOR
> ( source MAC XOR destination MAC ))
> modulo slave count
>
> + The IPv6 formula is
> +
> + hash = (source ip quad 2 XOR dest IP quad 2) XOR
> + (source ip quad 3 XOR dest IP quad 3) XOR
> + (source ip quad 4 XOR dest IP quad 4)
> +
> + (((hash >> 24) XOR (hash >> 16) XOR (hash >> 8) XOR hash)
> + XOR (source MAC XOR destination MAC))
> + modulo slave count
> +
> This algorithm will place all traffic to a particular
> network peer on the same slave. For non-IP traffic,
> the formula is the same as for the layer2 transmit
> @@ -778,19 +788,29 @@ xmit_hash_policy
> slaves, although a single connection will not span
> multiple slaves.
>
> - The formula for unfragmented TCP and UDP packets is
> + The formula for unfragmented IPv4 TCP and UDP packets is
>
> ((source port XOR dest port) XOR
> ((source IP XOR dest IP) AND 0xffff)
> modulo slave count
>
> - For fragmented TCP or UDP packets and all other IP
> - protocol traffic, the source and destination port
> + The formula for unfragmented IPv6 TCP and UDP packets is
> +
> + hash = (source port XOR dest port) XOR
> + ((source ip quad 2 XOR dest IP quad 2) XOR
> + (source ip quad 3 XOR dest IP quad 3) XOR
> + (source ip quad 4 XOR dest IP quad 4))
> +
> + ((hash >> 24) XOR (hash >> 16) XOR (hash >> 8) XOR hash)
> + modulo slave count
> +
> + For fragmented TCP or UDP packets and all other IPv4 and
> + IPv6 protocol traffic, the source and destination port
> information is omitted. For non-IP traffic, the
> formula is the same as for the layer2 transmit hash
> policy.
>
> - This policy is intended to mimic the behavior of
> + The IPv4 policy is intended to mimic the behavior of
> certain switches, notably Cisco switches with PFC2 as
> well as some Foundry and IBM products.
>
> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
> index d95fbc3..4221e57 100644
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -3354,56 +3354,93 @@ static struct notifier_block bond_netdev_notifier = {
> /*---------------------------- Hashing Policies -----------------------------*/
>
> /*
> + * Hash for the output device based upon layer 2 data
> + */
> +static int bond_xmit_hash_policy_l2(struct sk_buff *skb, int count)
> +{
> + struct ethhdr *data = (struct ethhdr *)skb->data;
> +
> + if (skb_headlen(skb) >= offsetof(struct ethhdr, h_proto))
> + return (data->h_dest[5] ^ data->h_source[5]) % count;
> +
> + return 0;
> +}
> +
> +/*
> * Hash for the output device based upon layer 2 and layer 3 data. If
> - * the packet is not IP mimic bond_xmit_hash_policy_l2()
> + * the packet is not IP, fall back on bond_xmit_hash_policy_l2()
> */
> static int bond_xmit_hash_policy_l23(struct sk_buff *skb, int count)
> {
> struct ethhdr *data = (struct ethhdr *)skb->data;
> - struct iphdr *iph = ip_hdr(skb);
> -
> - if (skb->protocol == htons(ETH_P_IP)) {
> + struct iphdr *iph;
> + struct ipv6hdr *ipv6h;
> + u32 v6hash;
> + __be32 *s, *d;
> +
> + if (skb->protocol == htons(ETH_P_IP) &&
> + skb_network_header_len(skb) >= sizeof(*iph)) {
> + iph = ip_hdr(skb);
> return ((ntohl(iph->saddr ^ iph->daddr) & 0xffff) ^
> (data->h_dest[5] ^ data->h_source[5])) % count;
> + } else if (skb->protocol == htons(ETH_P_IPV6) &&
> + skb_network_header_len(skb) >= sizeof(*ipv6h)) {
> + ipv6h = ipv6_hdr(skb);
> + s = &ipv6h->saddr.s6_addr32[0];
> + d = &ipv6h->daddr.s6_addr32[0];
> + v6hash = (s[1] ^ d[1]) ^ (s[2] ^ d[2]) ^ (s[3] ^ d[3]);
> + v6hash ^= (v6hash >> 24) ^ (v6hash >> 16) ^ (v6hash >> 8);
> + return (v6hash ^ data->h_dest[5] ^ data->h_source[5]) % count;
> }
>
> - return (data->h_dest[5] ^ data->h_source[5]) % count;
> + return bond_xmit_hash_policy_l2(skb, count);
> }
>
> /*
> * Hash for the output device based upon layer 3 and layer 4 data. If
> * the packet is a frag or not TCP or UDP, just use layer 3 data. If it is
> - * altogether not IP, mimic bond_xmit_hash_policy_l2()
> + * altogether not IP, fall back on bond_xmit_hash_policy_l2()
> */
> static int bond_xmit_hash_policy_l34(struct sk_buff *skb, int count)
> {
> - struct ethhdr *data = (struct ethhdr *)skb->data;
> - struct iphdr *iph = ip_hdr(skb);
> - __be16 *layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
> - int layer4_xor = 0;
> -
> - if (skb->protocol == htons(ETH_P_IP)) {
> + u32 layer4_xor = 0;
> + struct iphdr *iph;
> + struct ipv6hdr *ipv6h;
> + __be32 *s, *d;
> + __be16 *layer4hdr;
> +
> + if (skb->protocol == htons(ETH_P_IP) &&
> + skb_network_header_len(skb) >= sizeof(*iph)) {
> + iph = ip_hdr(skb);
> if (!ip_is_fragment(iph) &&
> (iph->protocol == IPPROTO_TCP ||
> - iph->protocol == IPPROTO_UDP)) {
> - layer4_xor = ntohs((*layer4hdr ^ *(layer4hdr + 1)));
> + iph->protocol == IPPROTO_UDP) &&
> + (skb_headlen(skb) - skb_network_offset(skb) >=
> + iph->ihl * sizeof(u32) + sizeof(*layer4hdr) * 2)) {
> + layer4hdr = (__be16 *)((u32 *)iph + iph->ihl);
> + layer4_xor = ntohs(*layer4hdr ^ *(layer4hdr + 1));
> }
> return (layer4_xor ^
> ((ntohl(iph->saddr ^ iph->daddr)) & 0xffff)) % count;
> -
> + } else if (skb->protocol == htons(ETH_P_IPV6) &&
> + skb_network_header_len(skb) >= sizeof(*ipv6h)) {
> + ipv6h = ipv6_hdr(skb);
> + if ((ipv6h->nexthdr == IPPROTO_TCP ||
> + ipv6h->nexthdr == IPPROTO_UDP) &&
> + (skb_headlen(skb) - skb_network_offset(skb) >=
> + sizeof(*ipv6h) + sizeof(*layer4hdr) * 2)) {
> + layer4hdr = (__be16 *)(ipv6h + 1);
> + layer4_xor = ntohs(*layer4hdr ^ *(layer4hdr + 1));
> + }
> + s = &ipv6h->saddr.s6_addr32[0];
> + d = &ipv6h->daddr.s6_addr32[0];
> + layer4_xor ^= (s[1] ^ d[1]) ^ (s[2] ^ d[2]) ^ (s[3] ^ d[3]);
> + layer4_xor ^= (layer4_xor >> 24) ^ (layer4_xor >> 16) ^
> + (layer4_xor >> 8);
> + return layer4_xor % count;
> }
>
> - return (data->h_dest[5] ^ data->h_source[5]) % count;
> -}
> -
> -/*
> - * Hash for the output device based upon layer 2 data
> - */
> -static int bond_xmit_hash_policy_l2(struct sk_buff *skb, int count)
> -{
> - struct ethhdr *data = (struct ethhdr *)skb->data;
> -
> - return (data->h_dest[5] ^ data->h_source[5]) % count;
> + return bond_xmit_hash_policy_l2(skb, count);
> }
>
> /*-------------------------- Device entry points ----------------------------*/
> --
> 1.7.11
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists