[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <0D2BAA86-E641-4582-A4EE-8E30CA71C660@bamaicloud.com>
Date: Tue, 27 May 2025 23:35:59 +0800
From: Tonghao Zhang <tonghao@...aicloud.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org,
Jay Vosburgh <jv@...sburgh.net>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>,
Simon Horman <horms@...nel.org>,
Jonathan Corbet <corbet@....net>,
Andrew Lunn <andrew+netdev@...n.ch>,
Steven Rostedt <rostedt@...dmis.org>,
Masami Hiramatsu <mhiramat@...nel.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Nikolay Aleksandrov <razor@...ckwall.org>,
Zengbing Tu <tuzengbing@...iglobal.com>
Subject: Re: [PATCH RESEND net-next v5 1/4] net: bonding: add
broadcast_neighbor option for 802.3ad
> 2025年5月27日 22:00,Paolo Abeni <pabeni@...hat.com> 写道:
>
> On 5/22/25 10:55 AM, Tonghao Zhang wrote:
>> Stacking technology is a type of technology used to expand ports on
>> Ethernet switches. It is widely used as a common access method in
>> large-scale Internet data center architectures. Years of practice
>> have proved that stacking technology has advantages and disadvantages
>> in high-reliability network architecture scenarios. For instance,
>> in stacking networking arch, conventional switch system upgrades
>> require multiple stacked devices to restart at the same time.
>> Therefore, it is inevitable that the business will be interrupted
>> for a while. It is for this reason that "no-stacking" in data centers
>> has become a trend. Additionally, when the stacking link connecting
>> the switches fails or is abnormal, the stack will split. Although it is
>> not common, it still happens in actual operation. The problem is that
>> after the split, it is equivalent to two switches with the same
>> configuration appearing in the network, causing network configuration
>> conflicts and ultimately interrupting the services carried by the
>> stacking system.
>>
>> To improve network stability, "non-stacking" solutions have been
>> increasingly adopted, particularly by public cloud providers and
>> tech companies like Alibaba, Tencent, and Didi. "non-stacking" is
>> a method of mimicing switch stacking that convinces a LACP peer,
>> bonding in this case, connected to a set of "non-stacked" switches
>> that all of its ports are connected to a single switch
>> (i.e., LACP aggregator), as if those switches were stacked. This
>> enables the LACP peer's ports to aggregate together, and requires
>> (a) special switch configuration, described in the linked article,
>> and (b) modifications to the bonding 802.3ad (LACP) mode to send
>> all ARP/ND packets across all ports of the active aggregator.
>>
>> Note that, with multiple aggregators, the current broadcast mode
>> logic will send only packets to the selected aggregator(s).
>>
>> +-----------+ +-----------+
>> | switch1 | | switch2 |
>> +-----------+ +-----------+
>> ^ ^
>> | |
>> +-----------------+
>> | bond4 lacp |
>> +-----------------+
>> | |
>> | NIC1 | NIC2
>> +-----------------+
>> | server |
>> +-----------------+
>
>
> IHMO all the above wording should be placed in the cover letter. This
> commit message should instead be more related to the actual code.
I add more commit message in this patch. In the future, everyone will have a clearer understanding of why this feature is needed.
This is the first version of the discussion:
https://patchwork.kernel.org/project/netdevbpf/patch/20250510044504.52618-2-tonghao@bamaicloud.com/
>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index d05226484c64..b5c34d7f126c 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -212,6 +212,8 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
>>
>> unsigned int bond_net_id __read_mostly;
>>
>> +DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>> +
>> static const struct flow_dissector_key flow_keys_bonding_keys[] = {
>> {
>> .key_id = FLOW_DISSECTOR_KEY_CONTROL,
>> @@ -4461,6 +4463,9 @@ static int bond_open(struct net_device *bond_dev)
>>
>> bond_for_each_slave(bond, slave, iter)
>> dev_mc_add(slave->dev, lacpdu_mcast_addr);
>> +
>> + if (bond->params.broadcast_neighbor)
>> + static_branch_inc(&bond_bcast_neigh_enabled);
>> }
>>
>> if (bond_mode_can_use_xmit_hash(bond))
>> @@ -4480,6 +4485,10 @@ static int bond_close(struct net_device *bond_dev)
>> bond_alb_deinitialize(bond);
>> bond->recv_probe = NULL;
>>
>> + if (BOND_MODE(bond) == BOND_MODE_8023AD &&
>> + bond->params.broadcast_neighbor)
>> + static_branch_dec(&bond_bcast_neigh_enabled);
>
> What if the user enables BOND_OPT_BROADCAST_NEIGH and later switch the
> bond mode? it looks like there will be bad accounting for static branch.
If you want to change the mode, then you must turn off the device first. So no bad accounting
This is v2 of the discussion:
https://patchwork.kernel.org/project/netdevbpf/patch/4270C010E5516F3B+20250513094750.23387-2-tonghao@bamaicloud.com/
>
>> +
>> if (bond_uses_primary(bond)) {
>> rcu_read_lock();
>> slave = rcu_dereference(bond->curr_active_slave);
>> @@ -5316,6 +5325,37 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>> return slaves->arr[hash % count];
>> }
>>
>> +static bool bond_should_broadcast_neighbor(struct sk_buff *skb,
>> + struct net_device *dev)
>> +{
>> + struct bonding *bond = netdev_priv(dev);
>> + struct {
>> + struct ipv6hdr ip6;
>> + struct icmp6hdr icmp6;
>> + } *combined, _combined;
>> +
>> + if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
>> + return false;
>> +
>> + if (!bond->params.broadcast_neighbor)
>> + return false;
>> +
>> + if (skb->protocol == htons(ETH_P_ARP))
>> + return true;
>> +
>> + if (skb->protocol == htons(ETH_P_IPV6)) {
>> + combined = skb_header_pointer(skb, skb_mac_header_len(skb),
>> + sizeof(_combined),
>> + &_combined);
>> + if (combined && combined->ip6.nexthdr == NEXTHDR_ICMP &&
>> + (combined->icmp6.icmp6_type == NDISC_NEIGHBOUR_SOLICITATION ||
>> + combined->icmp6.icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>> + return true;
>> + }
>> +
>> + return false;
>> +}
>> +
>> /* Use this Xmit function for 3AD as well as XOR modes. The current
>> * usable slave array is formed in the control path. The xmit function
>> * just calculates hash and sends the packet out.
>> @@ -5337,15 +5377,25 @@ static netdev_tx_t bond_3ad_xor_xmit(struct sk_buff *skb,
>>
>> /* in broadcast mode, we send everything to all usable interfaces. */
>> static netdev_tx_t bond_xmit_broadcast(struct sk_buff *skb,
>> - struct net_device *bond_dev)
>> + struct net_device *bond_dev,
>> + bool all_slaves)
>> {
>> struct bonding *bond = netdev_priv(bond_dev);
>> - struct slave *slave = NULL;
>> - struct list_head *iter;
>> + struct bond_up_slave *slaves;
>> bool xmit_suc = false;
>> bool skb_used = false;
>> + int slaves_count, i;
>>
>> - bond_for_each_slave_rcu(bond, slave, iter) {
>> + rcu_read_lock();
>
> Why this the RCU lock is now needed? AFAICS the caller scope does not
Using this to protect read side to access slave in all_slaves or usable_slaves. bond_start_xmit use the rcu_read_lock, so I can remove the rcu_read_lock in bond_xmit_broadcast.
> change. This is either not needed, or deserves a separate patch, with a
> suitable fixes tag.
>
> /P
>
>
>
Powered by blists - more mailing lists