[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <84F74587-5EEB-4D02-B49E-CDE65D963E3F@bamaicloud.com>
Date: Wed, 14 May 2025 14:07:11 +0800
From: Tonghao Zhang <tonghao@...aicloud.com>
To: Jay Vosburgh <jv@...sburgh.net>
Cc: Nikolay Aleksandrov <razor@...ckwall.org>,
netdev@...r.kernel.org,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>,
Jonathan Corbet <corbet@....net>,
Andrew Lunn <andrew+netdev@...n.ch>,
Zengbing Tu <tuzengbing@...iglobal.com>
Subject: Re: [PATCH net-next v2 1/4] net: bonding: add broadcast_neighbor
option for 802.3ad
> 2025年5月13日 21:30,Jay Vosburgh <jv@...sburgh.net> 写道:
>
> Nikolay Aleksandrov <razor@...ckwall.org> wrote:
>
>> On 5/13/25 15:13, Tonghao Zhang wrote:
>>>
>>>
>>>> 2025年5月13日 18:18,Nikolay Aleksandrov <razor@...ckwall.org> 写道:
>>>>
>>>> On 5/13/25 12:47, Tonghao Zhang wrote:
>>>>> Stacking technology is a type of technology used to expand ports on
>>>>> Ethernet switches. It is widely used as a common access method in
>>>>> large-scale Internet data center architectures. Years of practice
>>>>> have proved that stacking technology has advantages and disadvantages
>>>>> in high-reliability network architecture scenarios. For instance,
>>>>> in stacking networking arch, conventional switch system upgrades
>>>>> require multiple stacked devices to restart at the same time.
>>>>> Therefore, it is inevitable that the business will be interrupted
>>>>> for a while. It is for this reason that "no-stacking" in data centers
>>>>> has become a trend. Additionally, when the stacking link connecting
>>>>> the switches fails or is abnormal, the stack will split. Although it is
>>>>> not common, it still happens in actual operation. The problem is that
>>>>> after the split, it is equivalent to two switches with the same configuration
>>>>> appearing in the network, causing network configuration conflicts and
>>>>> ultimately interrupting the services carried by the stacking system.
>>>>>
>>>>> To improve network stability, "non-stacking" solutions have been increasingly
>>>>> adopted, particularly by public cloud providers and tech companies
>>>>> like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch
>>>>> stacking that convinces a LACP peer, bonding in this case, connected to a set of
>>>>> "non-stacked" switches that all of its ports are connected to a single
>>>>> switch (i.e., LACP aggregator), as if those switches were stacked. This
>>>>> enables the LACP peer's ports to aggregate together, and requires (a)
>>>>> special switch configuration, described in the linked article, and (b)
>>>>> modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
>>>>> packets across all ports of the active aggregator.
>>>>>
>>>>> ----------- -----------
>>>>> | switch1 | | switch2 |
>>>>> ----------- -----------
>>>>> ^ ^
>>>>> | |
>>>>> ---------------
>>>>> | bond4 lacp |
>>>>> ---------------
>>>>> | NIC1 | NIC2
>>>>> ---------------------
>>>>> | server |
>>>>> ---------------------
>>>>>
>>>>> - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
>>>>>
>>>>> Cc: Jay Vosburgh <jv@...sburgh.net>
>>>>> Cc: "David S. Miller" <davem@...emloft.net>
>>>>> Cc: Eric Dumazet <edumazet@...gle.com>
>>>>> Cc: Jakub Kicinski <kuba@...nel.org>
>>>>> Cc: Paolo Abeni <pabeni@...hat.com>
>>>>> Cc: Simon Horman <horms@...nel.org>
>>>>> Cc: Jonathan Corbet <corbet@....net>
>>>>> Cc: Andrew Lunn <andrew+netdev@...n.ch>
>>>>> Signed-off-by: Tonghao Zhang <tonghao@...aicloud.com>
>>>>> Signed-off-by: Zengbing Tu <tuzengbing@...iglobal.com>
>>>>> ---
>>>>> Documentation/networking/bonding.rst | 6 +++++
>>>>> drivers/net/bonding/bond_main.c | 39 ++++++++++++++++++++++++++++
>>>>> drivers/net/bonding/bond_options.c | 34 ++++++++++++++++++++++++
>>>>> drivers/net/bonding/bond_sysfs.c | 18 +++++++++++++
>>>>> include/net/bond_options.h | 1 +
>>>>> include/net/bonding.h | 3 +++
>>>>> 6 files changed, 101 insertions(+)
>>>>>
>>>>> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>>>>> index a4c1291d2561..14f7593d888d 100644
>>>>> --- a/Documentation/networking/bonding.rst
>>>>> +++ b/Documentation/networking/bonding.rst
>>>>> @@ -562,6 +562,12 @@ lacp_rate
>>>>>
>>>>> The default is slow.
>>>>>
>>>>> +broadcast_neighbor
>>>>> +
>>>>> + Option specifying whether to broadcast ARP/ND packets to all
>>>>> + active slaves. This option has no effect in modes other than
>>>>> + 802.3ad mode. The default is off (0).
>>>>> +
>>>>> max_bonds
>>>>>
>>>>> Specifies the number of bonding devices to create for this
>>>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>>>> index d05226484c64..8ee26ddddbc8 100644
>>>>> --- a/drivers/net/bonding/bond_main.c
>>>>> +++ b/drivers/net/bonding/bond_main.c
>>>>> @@ -212,6 +212,9 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
>>>>>
>>>>> unsigned int bond_net_id __read_mostly;
>>>>>
>>>>> +DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>>>> +EXPORT_SYMBOL_GPL(bond_bcast_neigh_enabled);
>>>>
>>>> No need to export the symbol, you can add bond helpers to inc/dec it.
>
> Agreed.
>
>>>>> +
>>>>> static const struct flow_dissector_key flow_keys_bonding_keys[] = {
>>>>> {
>>>>> .key_id = FLOW_DISSECTOR_KEY_CONTROL,
>>>>> @@ -4480,6 +4483,9 @@ static int bond_close(struct net_device *bond_dev)
>>>>> bond_alb_deinitialize(bond);
>>>>> bond->recv_probe = NULL;
>>>>>
>>>>> + if (bond->params.broadcast_neighbor)
>>>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>>>> +
>>>>
>>>> This branch doesn't get re-enabled if the bond is brought up afterwards.
>>> This is not right place to dec bond_bcast_neigh_enabled, I should dec this value in bond_uninit(). Because we can destroy a bond net device which broadcast_neighbor enabled.
>>> If we don’t check broadcast_neighbor in destroy path. bond_bcast_neigh_enabled always is enabled. For example:
>>> ip link add bondx type bond mode 802.3ad ... broadcast_neighbor on
>>> ip link add bondy type bond mode 802.3ad ... broadcast_neighbor off
>>>
>>> ip li del dev bondx
>>> In this case, bond_bcast_neigh_enabled is enabled for bondy while broadcast_neighbor is off.
>
> So it sounds like the flow should be:
>
> - increment the static key when:
> - option enabled on IFF_UP bond (transition from off to on)
> - bond set to up with option enabled (bond_open)
>
> - decrement the static key when:
> - option disabled on IFF_UP bond (transition from on to off)
> - bond with option enabled is set to down (bond_close)
>
> The unregister path in unregister_netdevice_many_notify that
> calls bond_uninit will call dev_close before the ndo_uninit, so I don't
> think we need to handle the static key in bond_uninit separately.
Thanks Jay, updated in v3
>
>
>>>>>>> if (bond_uses_primary(bond)) {
>>>>> rcu_read_lock();
>>>>> slave = rcu_dereference(bond->curr_active_slave);
>>>>> @@ -5316,6 +5322,35 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>>>>> return slaves->arr[hash % count];
>>>>> }
>>>>>
>>>>> +static inline bool bond_should_broadcast_neighbor(struct sk_buff *skb,
>>>>
>>>> don't use inline in .c files
>>> As suggested by Jay, inline the codes for performance. I think it is better to keep inline. By the way, there are many inline function in bond_main.c and other *.c
>>
>> This is a general rule, the compiler knows what to do. The inlines in bond_main are old
>> and can be removed. Do not add new ones.
>
> To clarify, the intent of my question was to ask whether or not
> the compiler was inlining the existing code in Tonghao's earlier patch.
> I did not intend to suggest that an explicit inline should be added.
>
> That said, I agree with Nikolay in that inline should not be
> added explicitly.
Ok
>
>>>>> + struct net_device *dev)
>>>>> +{
>>>>> + struct bonding *bond = netdev_priv(dev);
>>>>> +
>>>>> + if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
>>>>> + return false;
>>>>> +
>>>>> + if (!bond->params.broadcast_neighbor)
>>>>> + return false;
>>>>> +
>>>>> + if (skb->protocol == htons(ETH_P_ARP))
>>>>> + return true;
>>>>> +
>>>>> + if (skb->protocol == htons(ETH_P_IPV6) &&
>>>>> + pskb_may_pull(skb,
>>>>> + sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
>>>>> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
>>>>> + struct icmp6hdr *icmph = icmp6_hdr(skb);
>>>>> +
>>>>> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
>>>>> + (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>>>>> + return true;
>>>>> + }
>>>>> + }
>>>>> +
>>>>> + return false;
>>>>> +}
>>>>> +
>>>>> /* Use this Xmit function for 3AD as well as XOR modes. The current
>>>>> * usable slave array is formed in the control path. The xmit function
>>>>> * just calculates hash and sends the packet out.
>>>>> @@ -5583,6 +5618,9 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
>>>>> case BOND_MODE_ACTIVEBACKUP:
>>>>> return bond_xmit_activebackup(skb, dev);
>>>>> case BOND_MODE_8023AD:
>>>>> + if (bond_should_broadcast_neighbor(skb, dev))
>>>>> + return bond_xmit_broadcast(skb, dev);
>>>>> + fallthrough;
>>>>> case BOND_MODE_XOR:
>>>>> return bond_3ad_xor_xmit(skb, dev);
>>>>> case BOND_MODE_BROADCAST:
>>>>> @@ -6462,6 +6500,7 @@ static int __init bond_check_params(struct bond_params *params)
>>>>> eth_zero_addr(params->ad_actor_system);
>>>>> params->ad_user_port_key = ad_user_port_key;
>>>>> params->coupled_control = 1;
>>>>> + params->broadcast_neighbor = 0;
>>>>> if (packets_per_slave > 0) {
>>>>> params->reciprocal_packets_per_slave =
>>>>> reciprocal_value(packets_per_slave);
>>>>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>>>>> index 91893c29b899..dca52d93f513 100644
>>>>> --- a/drivers/net/bonding/bond_options.c
>>>>> +++ b/drivers/net/bonding/bond_options.c
>>>>> @@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
>>>>> const struct bond_opt_value *newval);
>>>>> static int bond_option_coupled_control_set(struct bonding *bond,
>>>>> const struct bond_opt_value *newval);
>>>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>>>> + const struct bond_opt_value *newval);
>>>>>
>>>>> static const struct bond_opt_value bond_mode_tbl[] = {
>>>>> { "balance-rr", BOND_MODE_ROUNDROBIN, BOND_VALFLAG_DEFAULT},
>>>>> @@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
>>>>> { NULL, -1, 0},
>>>>> };
>>>>>
>>>>> +static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
>>>>> + { "on", 1, 0},
>>>>> + { "off", 0, BOND_VALFLAG_DEFAULT},
>>>>
>>>> I know the option above is using this order, but it is a bit counter-intuitive to
>>>> have their places reversed wrt their values, could you please re-order these as
>>>> the other bond on/off options? This is a small nit, I don't have a strong preference
>>>> but it is more intuitive to have them in their value order. :)
>>> Ok
>>>>
>>>>> + { NULL, -1, 0}> +};
>>>>> +
>>>>> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>>>> [BOND_OPT_MODE] = {
>>>>> .id = BOND_OPT_MODE,
>>>>> @@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>>>> .flags = BOND_OPTFLAG_IFDOWN,
>>>>> .values = bond_coupled_control_tbl,
>>>>> .set = bond_option_coupled_control_set,
>>>>> + },
>>>>> + [BOND_OPT_BROADCAST_NEIGH] = {
>>>>> + .id = BOND_OPT_BROADCAST_NEIGH,
>>>>> + .name = "broadcast_neighbor",
>>>>> + .desc = "Broadcast neighbor packets to all slaves",
>>>>> + .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
>>>>> + .values = bond_broadcast_neigh_tbl,
>>>>> + .set = bond_option_broadcast_neigh_set,
>>>>> }
>>>>> };
>>>>>
>>>>> @@ -1840,3 +1856,21 @@ static int bond_option_coupled_control_set(struct bonding *bond,
>>>>> bond->params.coupled_control = newval->value;
>>>>> return 0;
>>>>> }
>>>>> +
>>>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>>>> + const struct bond_opt_value *newval)
>>>>> +{
>>>>> + netdev_dbg(bond->dev, "Setting broadcast_neighbor to %llu\n",
>>>>> + newval->value);
>>>>> +
>>>>> + if (bond->params.broadcast_neighbor == newval->value)
>>>>> + return 0;
>>>>> +
>>>>> + bond->params.broadcast_neighbor = newval->value;
>>>>> + if (bond->params.broadcast_neighbor)
>>>>> + static_branch_inc(&bond_bcast_neigh_enabled);
>>>>> + else
>>>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>>>
>>>> If the bond has been brought down then the branch has been already decremented.
>>>> You'll have to synchronize this with bond open/close or alternatively mark the option
>>>> as being able to be changed only when the bond is up (there is an option flag for that).
>>> I will check bond_bcast_neigh_enabled in bond_unint() instead of in bond_close().
>>>>>>
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
>>>>> index 1e13bb170515..4a53850b2c68 100644
>>>>> --- a/drivers/net/bonding/bond_sysfs.c
>>>>> +++ b/drivers/net/bonding/bond_sysfs.c
>>>>> @@ -752,6 +752,23 @@ static ssize_t bonding_show_ad_user_port_key(struct device *d,
>>>>> static DEVICE_ATTR(ad_user_port_key, 0644,
>>>>> bonding_show_ad_user_port_key, bonding_sysfs_store_option);
>>>>>
>>>>> +static ssize_t bonding_show_broadcast_neighbor(struct device *d,
>>>>> + struct device_attribute *attr,
>>>>> + char *buf)
>>>>> +{
>>>>> + struct bonding *bond = to_bond(d);
>>>>> + const struct bond_opt_value *val;
>>>>> +
>>>>> + val = bond_opt_get_val(BOND_OPT_BROADCAST_NEIGH,
>>>>> + bond->params.broadcast_neighbor);
>>>>> +
>>>>> + return sysfs_emit(buf, "%s %d\n", val->string,
>>>>> + bond->params.broadcast_neighbor);
>>>>> +}
>>>>> +
>>>>> +static DEVICE_ATTR(broadcast_neighbor, 0644,
>>>>> + bonding_show_broadcast_neighbor, bonding_sysfs_store_option);
>>>>> +
>>>>
>>>> sysfs options are deprecated, please don't extend sysfs
>>>> netlink is the preferred way for new options
>>> I think it is still useful to config option via sysfs, and I find other new option still use the sysfs.
>>
>> This is wrong, there are no new options that have been added to sysfs recently,
>> the latest option being "coupled_control". As I already said - sysfs has been deprecated
>> for quite some time, don't add new options to it.
>
> Agreed, no new options in bonding's sysfs. We are intentionally
> encouraging the use of netlink / iproute for bonding configuration
> instead of sysfs.
sysfs option of this patch, will be removed in v3.
>
> -J
>
>>>>> static struct attribute *per_bond_attrs[] = {
>>>>> &dev_attr_slaves.attr,
>>>>> &dev_attr_mode.attr,
>>>>> @@ -791,6 +808,7 @@ static struct attribute *per_bond_attrs[] = {
>>>>> &dev_attr_ad_actor_system.attr,
>>>>> &dev_attr_ad_user_port_key.attr,
>>>>> &dev_attr_arp_missed_max.attr,
>>>>> + &dev_attr_broadcast_neighbor.attr,
>>>>> NULL,
>>>>> };
>>>>>
>>>>> diff --git a/include/net/bond_options.h b/include/net/bond_options.h
>>>>> index 18687ccf0638..022b122a9fb6 100644
>>>>> --- a/include/net/bond_options.h
>>>>> +++ b/include/net/bond_options.h
>>>>> @@ -77,6 +77,7 @@ enum {
>>>>> BOND_OPT_NS_TARGETS,
>>>>> BOND_OPT_PRIO,
>>>>> BOND_OPT_COUPLED_CONTROL,
>>>>> + BOND_OPT_BROADCAST_NEIGH,
>>>>> BOND_OPT_LAST
>>>>> };
>>>>>
>>>>> diff --git a/include/net/bonding.h b/include/net/bonding.h
>>>>> index 95f67b308c19..e06f0d63b2c1 100644
>>>>> --- a/include/net/bonding.h
>>>>> +++ b/include/net/bonding.h
>>>>> @@ -115,6 +115,8 @@ static inline int is_netpoll_tx_blocked(struct net_device *dev)
>>>>> #define is_netpoll_tx_blocked(dev) (0)
>>>>> #endif
>>>>>
>>>>> +DECLARE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>>>> +
>>>>> struct bond_params {
>>>>> int mode;
>>>>> int xmit_policy;
>>>>> @@ -149,6 +151,7 @@ struct bond_params {
>>>>> struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
>>>>> #endif
>>>>> int coupled_control;
>>>>> + int broadcast_neighbor;
>>>>>
>>>>> /* 2 bytes of padding : see ether_addr_equal_64bits() */
>>>>> u8 ad_actor_system[ETH_ALEN + 2];
>>>>
>>>>
>>>>
>>>
>>
>
> ---
> -Jay Vosburgh, jv@...sburgh.net
Powered by blists - more mailing lists