[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <26343.1354064725@death.nxdomain>
Date: Tue, 27 Nov 2012 17:05:25 -0800
From: Jay Vosburgh <fubar@...ibm.com>
To: Jiri Bohac <jbohac@...e.cz>
cc: Andy Gospodarek <andy@...yhouse.net>, netdev@...r.kernel.org
Subject: Re: [PATCH][RESEND] bonding: delete migrated IP addresses from the rlb hash table
Jiri Bohac <jbohac@...e.cz> wrote:
>Hi,
>
>This is another resend of the patch discussed in June. The only
>changes over the previous version are improved comments.
>
>Bonding with balance_rlb keeps poisoning other machines' ARP
>caches and I whink we need to fix this.
>
>On Thu, Jun 21, 2012 at 04:05:19PM -0700, Jay Vosburgh wrote:
>> Jiri Bohac <jbohac@...e.cz> wrote:
>>
>> >Hi, this is a resend of the patch discussed here:
>> > http://thread.gmane.org/gmane.linux.network/228076
>> >It has been updated to apply to the lastest net-next.
>> [...]
>> >The hash table is hashed by ip_dst. To be able to do the above
>> >check efficiently (not walking the whole hash table), we need a
>> >reverse mapping (by ip_src).
>>
>> Just a note that I'm doing some testing with this patch. Seems
>> to be ok for the "direct" case (wherein the IP in question is assigned
>> to the local system); I haven't tried the "bridge" case yet. I've
>> extended some of the debugfs stuff to dump the new information, and I'm
>> trying some of the corner cases (e.g., breaking the linkages in the
>> middle) to see if it all hangs together.
>
>Were there any results of your testing? Good or bad?
I did test it quite a bit (and then neglected to follow up). I
tried various deliberate hash collisions to try and make it fail in a
corner case, but was unable to induce incorrect behavior.
>> I am thinking that the layout of the "hash"-ish table is now
>> sufficiently complicated that there should be a comment block somewhere
>> describing what's going on (because I didn't really quite get it until I
>> dumped the whole thing and looked at it). With this patch, there is one
>> "used" linkage for all of the elements in use, plus some number of "src"
>> linkages, one for each active source hash. The "src" linkages are also
>> notable in that they are separate from the "assigned" state.
>
>I updated the comments in drivers/net/bonding/bond_alb.h to
>describe the structure.
>
>> >+ * have a dirrerent mac_src.
>>
>> Typo here; should be "different."
>
>Fixed.
>Any chance we could finally get this merged?:
The only issue I see is that a number of added lines run past 80
columns, e.g.,
+ if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
+ /* ip_src is going to be updated, fix the src hash list */
+ u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));
+ * sending out client updates with this IP address and the old MAC address.
+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
... and so on. I did not compile and test this version, just
applied it and inspected it; presumably it is functionally identical to
the prior version. There's also one typo I noted near the end of the
patch.
-J
>Bonding in balance-alb mode records information from ARP packets
>passing through the bond in a hash table (rx_hashtbl).
>
>At certain situations (e.g. link change of a slave),
>rlb_update_rx_clients() will send out ARP packets to update ARP
>caches of other hosts on the network to achieve RX load
>balancing.
>
>The problem is that once an IP address is recorded in the hash
>table, it stays there indefinitely. If this IP address is
>migrated to a different host in the network, bonding still sends
>out ARP packets that poison other systems' ARP caches with
>invalid information.
>
>This patch solves this by looking at all incoming ARP packets,
>and checking if the source IP address is one of the source
>addresses stored in the rx_hashtbl. If it is, but the MAC
>addresses differ, the corresponding hash table entries are
>removed. Thus, when an IP address is migrated, the first ARP
>broadcast by its new owner will purge the offending entries of
>rx_hashtbl.
>
>The hash table is hashed by ip_dst. To be able to do the above
>check efficiently (not walking the whole hash table), we need a
>reverse mapping (by ip_src).
>
>I added three new members in struct rlb_client_info:
> rx_hashtbl[x].src_first will point to the start of a list of
> entries for which hash(ip_src) == x.
> The list is linked with src_next and src_prev.
>
>When an incoming ARP packet arrives at rlb_arp_recv()
>rlb_purge_src_ip() can quickly walk only the entries on the
>corresponding lists, i.e. the entries that are likely to contain
>the offending IP address.
>
>To avoid confusion, I renamed these existing fields of struct
>rlb_client_info:
> next -> used_next
> prev -> used_prev
> rx_hashtbl_head -> rx_hashtbl_used_head
>
>(The current linked list is _not_ a list of hash table
>entries with colliding ip_dst. It's a list of entries that are
>being used; its purpose is to avoid walking the whole hash table
>when looking for used entries.)
>
>Signed-off-by: Jiri Bohac <jbohac@...e.cz>
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index e15cc11..8505a24 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -84,6 +84,9 @@ static inline struct arp_pkt *arp_pkt(const struct sk_buff *skb)
>
> /* Forward declaration */
> static void alb_send_learning_packets(struct slave *slave, u8 mac_addr[]);
>+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp);
>+static void rlb_src_unlink(struct bonding *bond, u32 index);
>+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash);
>
> static inline u8 _simple_hash(const u8 *hash_start, int hash_size)
> {
>@@ -354,6 +357,17 @@ static int rlb_arp_recv(const struct sk_buff *skb, struct bonding *bond,
> if (!arp)
> goto out;
>
>+ /* We received an ARP from arp->ip_src.
>+ * We might have used this IP address previously (on the bonding host
>+ * itself or on a system that is bridged together with the bond).
>+ * However, if arp->mac_src is different than what is stored in
>+ * rx_hashtbl, some other host is now using the IP and we must prevent
>+ * sending out client updates with this IP address and the old MAC address.
>+ * Clean up all hash table entries that have this address as ip_src but
>+ * have a different mac_src.
>+ */
>+ rlb_purge_src_ip(bond, arp);
>+
> if (arp->op_code == htons(ARPOP_REPLY)) {
> /* update rx hash table for this ARP */
> rlb_update_entry_from_arp(bond, arp);
>@@ -432,9 +446,9 @@ static void rlb_clear_slave(struct bonding *bond, struct slave *slave)
> _lock_rx_hashtbl_bh(bond);
>
> rx_hash_table = bond_info->rx_hashtbl;
>- index = bond_info->rx_hashtbl_head;
>+ index = bond_info->rx_hashtbl_used_head;
> for (; index != RLB_NULL_INDEX; index = next_index) {
>- next_index = rx_hash_table[index].next;
>+ next_index = rx_hash_table[index].used_next;
> if (rx_hash_table[index].slave == slave) {
> struct slave *assigned_slave = rlb_next_rx_slave(bond);
>
>@@ -519,8 +533,8 @@ static void rlb_update_rx_clients(struct bonding *bond)
>
> _lock_rx_hashtbl_bh(bond);
>
>- hash_index = bond_info->rx_hashtbl_head;
>- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+ hash_index = bond_info->rx_hashtbl_used_head;
>+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> client_info = &(bond_info->rx_hashtbl[hash_index]);
> if (client_info->ntt) {
> rlb_update_client(client_info);
>@@ -548,8 +562,8 @@ static void rlb_req_update_slave_clients(struct bonding *bond, struct slave *sla
>
> _lock_rx_hashtbl_bh(bond);
>
>- hash_index = bond_info->rx_hashtbl_head;
>- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+ hash_index = bond_info->rx_hashtbl_used_head;
>+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> client_info = &(bond_info->rx_hashtbl[hash_index]);
>
> if ((client_info->slave == slave) &&
>@@ -578,8 +592,8 @@ static void rlb_req_update_subnet_clients(struct bonding *bond, __be32 src_ip)
>
> _lock_rx_hashtbl(bond);
>
>- hash_index = bond_info->rx_hashtbl_head;
>- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+ hash_index = bond_info->rx_hashtbl_used_head;
>+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> client_info = &(bond_info->rx_hashtbl[hash_index]);
>
> if (!client_info->slave) {
>@@ -625,6 +639,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> /* update mac address from arp */
> memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
> }
>+ memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
>
> assigned_slave = client_info->slave;
> if (assigned_slave) {
>@@ -647,6 +662,13 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> assigned_slave = rlb_next_rx_slave(bond);
>
> if (assigned_slave) {
>+ if (!(client_info->assigned && client_info->ip_src == arp->ip_src)) {
>+ /* ip_src is going to be updated, fix the src hash list */
>+ u32 hash_src = _simple_hash((u8 *)&arp->ip_src, sizeof(arp->ip_src));
>+ rlb_src_unlink(bond, hash_index);
>+ rlb_src_link(bond, hash_src, hash_index);
>+ }
>+
> client_info->ip_src = arp->ip_src;
> client_info->ip_dst = arp->ip_dst;
> /* arp->mac_dst is broadcast for arp reqeusts.
>@@ -654,6 +676,7 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> * upon receiving an arp reply.
> */
> memcpy(client_info->mac_dst, arp->mac_dst, ETH_ALEN);
>+ memcpy(client_info->mac_src, arp->mac_src, ETH_ALEN);
> client_info->slave = assigned_slave;
>
> if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast)) {
>@@ -669,11 +692,11 @@ static struct slave *rlb_choose_channel(struct sk_buff *skb, struct bonding *bon
> }
>
> if (!client_info->assigned) {
>- u32 prev_tbl_head = bond_info->rx_hashtbl_head;
>- bond_info->rx_hashtbl_head = hash_index;
>- client_info->next = prev_tbl_head;
>+ u32 prev_tbl_head = bond_info->rx_hashtbl_used_head;
>+ bond_info->rx_hashtbl_used_head = hash_index;
>+ client_info->used_next = prev_tbl_head;
> if (prev_tbl_head != RLB_NULL_INDEX) {
>- bond_info->rx_hashtbl[prev_tbl_head].prev =
>+ bond_info->rx_hashtbl[prev_tbl_head].used_prev =
> hash_index;
> }
> client_info->assigned = 1;
>@@ -740,8 +763,8 @@ static void rlb_rebalance(struct bonding *bond)
> _lock_rx_hashtbl_bh(bond);
>
> ntt = 0;
>- hash_index = bond_info->rx_hashtbl_head;
>- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+ hash_index = bond_info->rx_hashtbl_used_head;
>+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> client_info = &(bond_info->rx_hashtbl[hash_index]);
> assigned_slave = rlb_next_rx_slave(bond);
> if (assigned_slave && (client_info->slave != assigned_slave)) {
>@@ -759,11 +782,113 @@ static void rlb_rebalance(struct bonding *bond)
> }
>
> /* Caller must hold rx_hashtbl lock */
>+static void rlb_init_table_entry_dst(struct rlb_client_info *entry)
>+{
>+ entry->used_next = RLB_NULL_INDEX;
>+ entry->used_prev = RLB_NULL_INDEX;
>+ entry->assigned = 0;
>+ entry->slave = NULL;
>+ entry->tag = 0;
>+}
>+static void rlb_init_table_entry_src(struct rlb_client_info *entry)
>+{
>+ entry->src_first = RLB_NULL_INDEX;
>+ entry->src_prev = RLB_NULL_INDEX;
>+ entry->src_next = RLB_NULL_INDEX;
>+}
>+
> static void rlb_init_table_entry(struct rlb_client_info *entry)
> {
> memset(entry, 0, sizeof(struct rlb_client_info));
>- entry->next = RLB_NULL_INDEX;
>- entry->prev = RLB_NULL_INDEX;
>+ rlb_init_table_entry_dst(entry);
>+ rlb_init_table_entry_src(entry);
>+}
>+
>+static void rlb_delete_table_entry_dst(struct bonding *bond, u32 index)
>+{
>+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+ u32 next_index = bond_info->rx_hashtbl[index].used_next;
>+ u32 prev_index = bond_info->rx_hashtbl[index].used_prev;
>+
>+ if (index == bond_info->rx_hashtbl_used_head)
>+ bond_info->rx_hashtbl_used_head = next_index;
>+ if (prev_index != RLB_NULL_INDEX)
>+ bond_info->rx_hashtbl[prev_index].used_next = next_index;
>+ if (next_index != RLB_NULL_INDEX)
>+ bond_info->rx_hashtbl[next_index].used_prev = prev_index;
>+}
>+
>+/* unlink a rlb hash table entry from the src list */
>+static void rlb_src_unlink(struct bonding *bond, u32 index)
>+{
>+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+ u32 next_index = bond_info->rx_hashtbl[index].src_next;
>+ u32 prev_index = bond_info->rx_hashtbl[index].src_prev;
>+
>+ bond_info->rx_hashtbl[index].src_next = RLB_NULL_INDEX;
>+ bond_info->rx_hashtbl[index].src_prev = RLB_NULL_INDEX;
>+
>+ if (next_index != RLB_NULL_INDEX)
>+ bond_info->rx_hashtbl[next_index].src_prev = prev_index;
>+
>+ if (prev_index == RLB_NULL_INDEX)
>+ return;
>+
>+ /* is prev_index pointing to the head of this list? */
>+ if (bond_info->rx_hashtbl[prev_index].src_first == index)
>+ bond_info->rx_hashtbl[prev_index].src_first = next_index;
>+ else
>+ bond_info->rx_hashtbl[prev_index].src_next = next_index;
>+
>+}
>+
>+static void rlb_delete_table_entry(struct bonding *bond, u32 index)
>+{
>+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+ struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
>+
>+ rlb_delete_table_entry_dst(bond, index);
>+ rlb_init_table_entry_dst(entry);
>+
>+ rlb_src_unlink(bond, index);
>+}
>+
>+/* add the rx_hashtbl[ip_dst_hash] entry to the list
>+ * of entries with identical ip_src_hash
>+ */
>+static void rlb_src_link(struct bonding *bond, u32 ip_src_hash, u32 ip_dst_hash)
>+{
>+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+ u32 next;
>+
>+ bond_info->rx_hashtbl[ip_dst_hash].src_prev = ip_src_hash;
>+ next = bond_info->rx_hashtbl[ip_src_hash].src_first;
>+ bond_info->rx_hashtbl[ip_dst_hash].src_next = next;
>+ if (next != RLB_NULL_INDEX)
>+ bond_info->rx_hashtbl[next].src_prev = ip_dst_hash;
>+ bond_info->rx_hashtbl[ip_src_hash].src_first = ip_dst_hash;
>+}
>+
>+/* deletes all rx_hashtbl entries with arp->ip_src if their mac_src does
>+ * not match arp->mac_src */
>+static void rlb_purge_src_ip(struct bonding *bond, struct arp_pkt *arp)
>+{
>+ struct alb_bond_info *bond_info = &(BOND_ALB_INFO(bond));
>+ u32 ip_src_hash = _simple_hash((u8*)&(arp->ip_src), sizeof(arp->ip_src));
>+ u32 index;
>+
>+ _lock_rx_hashtbl_bh(bond);
>+
>+ index = bond_info->rx_hashtbl[ip_src_hash].src_first;
>+ while (index != RLB_NULL_INDEX) {
>+ struct rlb_client_info *entry = &(bond_info->rx_hashtbl[index]);
>+ u32 next_index = entry->src_next;
>+ if (entry->ip_src == arp->ip_src &&
>+ !ether_addr_equal_64bits(arp->mac_src, entry->mac_src))
>+ rlb_delete_table_entry(bond, index);
>+ index = next_index;
>+ }
>+ _unlock_rx_hashtbl_bh(bond);
> }
>
> static int rlb_initialize(struct bonding *bond)
>@@ -781,7 +906,7 @@ static int rlb_initialize(struct bonding *bond)
>
> bond_info->rx_hashtbl = new_hashtbl;
>
>- bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
>+ bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
>
> for (i = 0; i < RLB_HASH_TABLE_SIZE; i++) {
> rlb_init_table_entry(bond_info->rx_hashtbl + i);
>@@ -803,7 +928,7 @@ static void rlb_deinitialize(struct bonding *bond)
>
> kfree(bond_info->rx_hashtbl);
> bond_info->rx_hashtbl = NULL;
>- bond_info->rx_hashtbl_head = RLB_NULL_INDEX;
>+ bond_info->rx_hashtbl_used_head = RLB_NULL_INDEX;
>
> _unlock_rx_hashtbl_bh(bond);
> }
>@@ -815,25 +940,13 @@ static void rlb_clear_vlan(struct bonding *bond, unsigned short vlan_id)
>
> _lock_rx_hashtbl_bh(bond);
>
>- curr_index = bond_info->rx_hashtbl_head;
>+ curr_index = bond_info->rx_hashtbl_used_head;
> while (curr_index != RLB_NULL_INDEX) {
> struct rlb_client_info *curr = &(bond_info->rx_hashtbl[curr_index]);
>- u32 next_index = bond_info->rx_hashtbl[curr_index].next;
>- u32 prev_index = bond_info->rx_hashtbl[curr_index].prev;
>-
>- if (curr->tag && (curr->vlan_id == vlan_id)) {
>- if (curr_index == bond_info->rx_hashtbl_head) {
>- bond_info->rx_hashtbl_head = next_index;
>- }
>- if (prev_index != RLB_NULL_INDEX) {
>- bond_info->rx_hashtbl[prev_index].next = next_index;
>- }
>- if (next_index != RLB_NULL_INDEX) {
>- bond_info->rx_hashtbl[next_index].prev = prev_index;
>- }
>+ u32 next_index = bond_info->rx_hashtbl[curr_index].used_next;
>
>- rlb_init_table_entry(curr);
>- }
>+ if (curr->tag && (curr->vlan_id == vlan_id))
>+ rlb_delete_table_entry(bond, curr_index);
>
> curr_index = next_index;
> }
>diff --git a/drivers/net/bonding/bond_alb.h b/drivers/net/bonding/bond_alb.h
>index 90f140a..de831ba 100644
>--- a/drivers/net/bonding/bond_alb.h
>+++ b/drivers/net/bonding/bond_alb.h
>@@ -94,15 +94,35 @@ struct tlb_client_info {
>
> /* -------------------------------------------------------------------------
> * struct rlb_client_info contains all info related to a specific rx client
>- * connection. This is the Clients Hash Table entry struct
>+ * connection. This is the Clients Hash Table entry struct.
>+ * Note that this is not a proper hash table; if a new client's IP address
>+ * hash collides with an existing client entry, the old entry is replaced.
>+ *
>+ * There is a linked list (linked by the used_next and used_prev members)
>+ * linking all the used entries of the hash table. This allows updating
>+ * all the clients without walking over all the unused elements of the table.
>+ *
>+ * There are also linked lists of entries with identical hash(ip_src). These
>+ * allow cleaning up the table from ip_src<->mac_src associatins that have
Typo here, "associations"
>+ * become outdated and would cause sending out invalid ARP updates to the
>+ * network. These are linked by the (src_next and src_prev members).
> * -------------------------------------------------------------------------
> */
> struct rlb_client_info {
> __be32 ip_src; /* the server IP address */
> __be32 ip_dst; /* the client IP address */
>+ u8 mac_src[ETH_ALEN]; /* the server MAC address */
> u8 mac_dst[ETH_ALEN]; /* the client MAC address */
>- u32 next; /* The next Hash table entry index */
>- u32 prev; /* The previous Hash table entry index */
>+
>+ /* list of used hash table entries, starting at rx_hashtbl_used_head */
>+ u32 used_next;
>+ u32 used_prev;
>+
>+ /* ip_src based hashing */
>+ u32 src_next; /* next entry with same hash(ip_src) */
>+ u32 src_prev; /* prev entry with same hash(ip_src) */
>+ u32 src_first; /* first entry with hash(ip_src) == this entry's index */
>+
> u8 assigned; /* checking whether this entry is assigned */
> u8 ntt; /* flag - need to transmit client info */
> struct slave *slave; /* the slave assigned to this client */
>@@ -131,7 +151,7 @@ struct alb_bond_info {
> int rlb_enabled;
> struct rlb_client_info *rx_hashtbl; /* Receive hash table */
> spinlock_t rx_hashtbl_lock;
>- u32 rx_hashtbl_head;
>+ u32 rx_hashtbl_used_head;
> u8 rx_ntt; /* flag - need to transmit
> * to all rx clients
> */
>diff --git a/drivers/net/bonding/bond_debugfs.c b/drivers/net/bonding/bond_debugfs.c
>index 2cf084e..6ac855f 100644
>--- a/drivers/net/bonding/bond_debugfs.c
>+++ b/drivers/net/bonding/bond_debugfs.c
>@@ -31,8 +31,8 @@ static int bond_debug_rlb_hash_show(struct seq_file *m, void *v)
>
> spin_lock_bh(&(BOND_ALB_INFO(bond).rx_hashtbl_lock));
>
>- hash_index = bond_info->rx_hashtbl_head;
>- for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->next) {
>+ hash_index = bond_info->rx_hashtbl_used_head;
>+ for (; hash_index != RLB_NULL_INDEX; hash_index = client_info->used_next) {
> client_info = &(bond_info->rx_hashtbl[hash_index]);
> seq_printf(m, "%-15pI4 %-15pI4 %-17pM %s\n",
> &client_info->ip_src,
>
>--
>Jiri Bohac <jbohac@...e.cz>
>SUSE Labs, SUSE CZ
>
---
-Jay Vosburgh, IBM Linux Technology Center, fubar@...ibm.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists