[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <CE4DB782-91EB-4DBD-9C26-CA4C4612D58C@bamaicloud.com>
Date: Sun, 11 May 2025 22:07:46 +0800
From: Tonghao Zhang <tonghao@...aicloud.com>
To: Jay Vosburgh <jv@...sburgh.net>
Cc: netdev@...r.kernel.org,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
Jonathan Corbet <corbet@....net>,
Andrew Lunn <andrew+netdev@...n.ch>
Subject: Re: [PATCH net-next 1/4] net: bonding: add broadcast_neighbor option
for 802.3ad
> 2025年5月10日 下午8:44,Jay Vosburgh <jv@...sburgh.net> 写道:
>
> tonghao@...aicloud.com wrote:
>
>> From: Tonghao Zhang <tonghao@...aicloud.com>
>>
>> Stacking technology provides technical benefits but has inherent drawbacks.
>> For instance, switch software or system upgrades require simultaneous reboots
>> of all stacked switches. Additionally, stacking link failures may cause
>> stack splitting.
>>
>> To improve network stability, non-stacking solutions have been increasingly
>> adopted, particularly by public cloud providers and technology companies
>> like Alibaba, Tencent, and Didi. The server still uses dual network cards and
>> dual uplinks to two switches, and the network card mode is set to
>> bond mode 4 (IEEE 802.3ad). As aggregation ports transmit ARP/ND data
>> exclusively through one physical port, both switches in non-stacking
>> deployments must receive server ARP/ND requests. This requires bonding driver
>> modifications to broadcast ARP/ND packets through all active slave links.
>>
>> - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
>
> I didn't really follow the explanation here without reading the
> linked article. I think it would be better to explain the basics of
> what this "non-stacking" architecture is, then describe the change to
> bonding necessary to support it. This description need not go into
> great detail; assuming I understand correctly, perhaps something along
> the lines of:
>
> "non-stacking" is a method of mimicing switch stacking that
> convinces a LACP peer, bonding in this case, connected to a set of
> "non-stacked" switches that all of its ports are connected to a single
> switch (i.e., LACP aggregator), as if those switches were stacked. This
> enables the LACP peer's ports to aggregate together, and requires (a)
> special switch configuration, described in the linked article, and (b)
> modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
> packets across all ports of the active aggregator.
>
> Is that a fair summary?
Yes, pretty good. I will add more info of “no-stacking” arch and your summary.
> Regardless, the commit message should stand on its own, even if
> the linked article is gone at some point in the future.
Yes
>
>> Cc: Jay Vosburgh <jv@...sburgh.net>
>> Cc: "David S. Miller" <davem@...emloft.net>
>> Cc: Eric Dumazet <edumazet@...gle.com>
>> Cc: Jakub Kicinski <kuba@...nel.org>
>> Cc: Paolo Abeni <pabeni@...hat.com>
>> Cc: Simon Horman <horms@...nel.org>
>> Cc: Jonathan Corbet <corbet@....net>
>> Cc: Andrew Lunn <andrew+netdev@...n.ch>
>> Signed-off-by: Tonghao Zhang <tonghao@...aicloud.com>
>> ---
>> Documentation/networking/bonding.rst | 5 +++
>> drivers/net/bonding/bond_main.c | 58 +++++++++++++++++++++-------
>> drivers/net/bonding/bond_options.c | 25 ++++++++++++
>> drivers/net/bonding/bond_sysfs.c | 18 +++++++++
>> include/net/bond_options.h | 1 +
>> include/net/bonding.h | 1 +
>> 6 files changed, 94 insertions(+), 14 deletions(-)
>>
>> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>> index a4c1291d2561..0aca6e7599db 100644
>> --- a/Documentation/networking/bonding.rst
>> +++ b/Documentation/networking/bonding.rst
>> @@ -562,6 +562,11 @@ lacp_rate
>>
>> The default is slow.
>>
>> +broadcast_neighbor
>> +
>> + Option specifying whether to broadcast ARP/ND packets to all
>> + active slaves. The default is off (0).
>> +
>
> I'm happy to see documentation updates; however, please add the
> caveat that this option has no effect in modes other than 802.3ad mode.
> Also, the text is not formatted consistently with the options around it
> (block indented one tab).
Ok, update it in next patch version.
>> max_bonds
>>
>> Specifies the number of bonding devices to create for this
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index d05226484c64..c54bfba10688 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -5316,23 +5316,31 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>> return slaves->arr[hash % count];
>> }
>>
>> -/* Use this Xmit function for 3AD as well as XOR modes. The current
>> - * usable slave array is formed in the control path. The xmit function
>> - * just calculates hash and sends the packet out.
>> - */
>> -static netdev_tx_t bond_3ad_xor_xmit(struct sk_buff *skb,
>> - struct net_device *dev)
>> +static bool bond_should_broadcast_neighbor(struct bonding *bond,
>> + struct sk_buff *skb)
>> {
>> - struct bonding *bond = netdev_priv(dev);
>> - struct bond_up_slave *slaves;
>> - struct slave *slave;
>> + if (BOND_MODE(bond) != BOND_MODE_8023AD)
>> + return false;
>>
>> - slaves = rcu_dereference(bond->usable_slaves);
>> - slave = bond_xmit_3ad_xor_slave_get(bond, skb, slaves);
>> - if (likely(slave))
>> - return bond_dev_queue_xmit(bond, skb, slave->dev);
>> + if (!bond->params.broadcast_neighbor)
>> + return false;
>>
>> - return bond_tx_drop(dev, skb);
>> + if (skb->protocol == htons(ETH_P_ARP))
>> + return true;
>> +
>> + if (skb->protocol == htons(ETH_P_IPV6) &&
>> + pskb_may_pull(skb,
>> + sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
>> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
>> + struct icmp6hdr *icmph = icmp6_hdr(skb);
>> +
>> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
>> + (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>> + return true;
>> + }
>> + }
>> +
>> + return false;
>> }
>>
>> /* in broadcast mode, we send everything to all usable interfaces. */
>> @@ -5377,6 +5385,28 @@ static netdev_tx_t bond_xmit_broadcast(struct sk_buff *skb,
>> return NET_XMIT_DROP;
>> }
>>
>> +/* Use this Xmit function for 3AD as well as XOR modes. The current
>> + * usable slave array is formed in the control path. The xmit function
>> + * just calculates hash and sends the packet out.
>> + */
>> +static netdev_tx_t bond_3ad_xor_xmit(struct sk_buff *skb,
>> + struct net_device *dev)
>> +{
>> + struct bonding *bond = netdev_priv(dev);
>> + struct bond_up_slave *slaves;
>> + struct slave *slave;
>> +
>> + if (bond_should_broadcast_neighbor(bond, skb))
>> + return bond_xmit_broadcast(skb, dev);
>
> I feel like there has to be a way to implement this that won't
> add two or three branches to the transmit fast path for every packet
> when this option is not enabled (which will be the vast majority of
> cases). I'm not sure what that would look like, needs some thought.
The patch should check the .broadcast_neighbor and bond mode firstly, and ipv4 packets quickly, and then ipv6. Parsing network packets is unavoidable, especially for IPv6 packets.
static inline bool bond_should_broadcast_neighbor(struct bonding *bond,
struct sk_buff *skb)
{
if (!bond->params.broadcast_neighbor ||
BOND_MODE(bond) != BOND_MODE_8023AD)
return false;
if (skb->protocol == htons(ETH_P_ARP))
return true;
if (skb->protocol == htons(ETH_P_IPV6) &&
pskb_may_pull(skb,
sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
struct icmp6hdr *icmph = icmp6_hdr(skb);
if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
(icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
return true;
}
}
return false;
}
>
>> max_bonds
>>
>> Specifies the number of bonding devices to create for this
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index d05226484c64..c54bfba10688 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -5316,23 +5316,31 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>> return slaves->arr[hash % count];
>> }
>>
>> -/* Use this Xmit function for 3AD as well as XOR modes. The current
>> - * usable slave array is formed in the control path. The xmit function
>> - * just calculates hash and sends the packet out.
>> - */
>> -static netdev_tx_t bond_3ad_xor_xmit(struct sk_buff *skb,
>> - struct net_device *dev)
>> +static bool bond_should_broadcast_neighbor(struct bonding *bond,
>> + struct sk_buff *skb)
>> {
>> - struct bonding *bond = netdev_priv(dev);
>> - struct bond_up_slave *slaves;
>> - struct slave *slave;
>> + if (BOND_MODE(bond) != BOND_MODE_8023AD)
>> + return false;
>>
>> - slaves = rcu_dereference(bond->usable_slaves);
>> - slave = bond_xmit_3ad_xor_slave_get(bond, skb, slaves);
>> - if (likely(slave))
>> - return bond_dev_queue_xmit(bond, skb, slave->dev);
>> + if (!bond->params.broadcast_neighbor)
>> + return false;
>>
>> - return bond_tx_drop(dev, skb);
>> + if (skb->protocol == htons(ETH_P_ARP))
>> + return true;
>> +
>> + if (skb->protocol == htons(ETH_P_IPV6) &&
>> + pskb_may_pull(skb,
>> + sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
>> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
>> + struct icmp6hdr *icmph = icmp6_hdr(skb);
>> +
>> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
>> + (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>> + return true;
>> + }
>> + }
>> +
>> + return false;
>> }
>>
>> /* in broadcast mode, we send everything to all usable interfaces. */
>> @@ -5377,6 +5385,28 @@ static netdev_tx_t bond_xmit_broadcast(struct sk_buff *skb,
>> return NET_XMIT_DROP;
>> }
>>
>> +/* Use this Xmit function for 3AD as well as XOR modes. The current
>> + * usable slave array is formed in the control path. The xmit function
>> + * just calculates hash and sends the packet out.
>> + */
>> +static netdev_tx_t bond_3ad_xor_xmit(struct sk_buff *skb,
>> + struct net_device *dev)
>> +{
>> + struct bonding *bond = netdev_priv(dev);
>> + struct bond_up_slave *slaves;
>> + struct slave *slave;
>> +
>> + if (bond_should_broadcast_neighbor(bond, skb))
>> + return bond_xmit_broadcast(skb, dev);
>
> I feel like there has to be a way to implement this that won't
> add two or three branches to the transmit fast path for every packet
> when this option is not enabled (which will be the vast majority of
> cases). I'm not sure what that would look like, needs some thought.
>
> Is the call to bond_should_broadcast_neighbor inlined at compile
> time?
>
> -J
>
>> +
>> + slaves = rcu_dereference(bond->usable_slaves);
>> + slave = bond_xmit_3ad_xor_slave_get(bond, skb, slaves);
>> + if (likely(slave))
>> + return bond_dev_queue_xmit(bond, skb, slave->dev);
>> +
>> + return bond_tx_drop(dev, skb);
>> +}
>> +
>> /*------------------------- Device initialization ---------------------------*/
>>
>> /* Lookup the slave that corresponds to a qid */
>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>> index 91893c29b899..38e8f03d1707 100644
>> --- a/drivers/net/bonding/bond_options.c
>> +++ b/drivers/net/bonding/bond_options.c
>> @@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
>> const struct bond_opt_value *newval);
>> static int bond_option_coupled_control_set(struct bonding *bond,
>> const struct bond_opt_value *newval);
>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>> + const struct bond_opt_value *newval);
>>
>> static const struct bond_opt_value bond_mode_tbl[] = {
>> { "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
>> @@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
>> { NULL, -1, 0},
>> };
>>
>> +static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
>> + { "on", 1, 0},
>> + { "off", 0, BOND_VALFLAG_DEFAULT},
>> + { NULL, -1, 0}
>> +};
>> +
>> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>> [BOND_OPT_MODE] = {
>> .id = BOND_OPT_MODE,
>> @@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>> .flags = BOND_OPTFLAG_IFDOWN,
>> .values = bond_coupled_control_tbl,
>> .set = bond_option_coupled_control_set,
>> + },
>> + [BOND_OPT_BROADCAST_NEIGH] = {
>> + .id = BOND_OPT_BROADCAST_NEIGH,
>> + .name = "broadcast_neighbor",
>> + .desc = "Broadcast neighbor packets to all slaves",
>> + .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
>> + .values = bond_broadcast_neigh_tbl,
>> + .set = bond_option_broadcast_neigh_set,
>> }
>> };
>>
>> @@ -1840,3 +1856,12 @@ static int bond_option_coupled_control_set(struct bonding *bond,
>> bond->params.coupled_control = newval->value;
>> return 0;
>> }
>> +
>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>> + const struct bond_opt_value *newval)
>> +{
>> + netdev_dbg(bond->dev, "Setting broadcast_neighbor to %llu\n",
>> + newval->value);
>> + bond->params.broadcast_neighbor = newval->value;
>> + return 0;
>> +}
>> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>> index 1e13bb170515..76f2a1bf57c2 100644
>> --- a/drivers/net/bonding/bond_sysfs.c
>> +++ b/drivers/net/bonding/bond_sysfs.c
>> @@ -752,6 +752,23 @@ static ssize_t bonding_show_ad_user_port_key(struct device *d,
>> static DEVICE_ATTR(ad_user_port_key, 0644,
>> bonding_show_ad_user_port_key, bonding_sysfs_store_option);
>>
>> +static ssize_t bonding_show_broadcast_neighbor(struct device *d,
>> + struct device_attribute *attr,
>> + char *buf)
>> +{
>> + struct bonding *bond = to_bond(d);
>> + const struct bond_opt_value *val;
>> +
>> + val = bond_opt_get_val(BOND_OPT_BROADCAST_NEIGH,
>> + bond->params.broadcast_neighbor);
>> +
>> + return sysfs_emit(buf, "%s %d\n", val->string,
>> + bond->params.broadcast_neighbor);
>> +}
>> +
>> +static DEVICE_ATTR(broadcast_neighbor, 0644,
>> + bonding_show_broadcast_neighbor, bonding_sysfs_store_option);
>> +
>> static struct attribute *per_bond_attrs[] = {
>> &dev_attr_slaves.attr,
>> &dev_attr_mode.attr,
>> @@ -791,6 +808,7 @@ static struct attribute *per_bond_attrs[] = {
>> &dev_attr_ad_actor_system.attr,
>> &dev_attr_ad_user_port_key.attr,
>> &dev_attr_arp_missed_max.attr,
>> + &dev_attr_broadcast_neighbor.attr,
>> NULL,
>> };
>>
>> diff --git a/include/net/bond_options.h b/include/net/bond_options.h
>> index 18687ccf0638..022b122a9fb6 100644
>> --- a/include/net/bond_options.h
>> +++ b/include/net/bond_options.h
>> @@ -77,6 +77,7 @@ enum {
>> BOND_OPT_NS_TARGETS,
>> BOND_OPT_PRIO,
>> BOND_OPT_COUPLED_CONTROL,
>> + BOND_OPT_BROADCAST_NEIGH,
>> BOND_OPT_LAST
>> };
>>
>> diff --git a/include/net/bonding.h b/include/net/bonding.h
>> index 95f67b308c19..1eafd15eaad9 100644
>> --- a/include/net/bonding.h
>> +++ b/include/net/bonding.h
>> @@ -149,6 +149,7 @@ struct bond_params {
>> struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
>> #endif
>> int coupled_control;
>> + int broadcast_neighbor;
>>
>> /* 2 bytes of padding : see ether_addr_equal_64bits() */
>> u8 ad_actor_system[ETH_ALEN + 2];
>> --
>> 2.34.1
>>
>
> ---
> -Jay Vosburgh, jv@...sburgh.net
Powered by blists - more mailing lists