lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <52315.1747374801@vermin>
Date: Fri, 16 May 2025 07:53:21 +0200
From: Jay Vosburgh <jv@...sburgh.net>
To: Tonghao Zhang <tonghao@...aicloud.com>
cc: netdev@...r.kernel.org, "David S. Miller" <davem@...emloft.net>,
    Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
    Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
    Jonathan Corbet <corbet@....net>,
    Andrew Lunn <andrew+netdev@...n.ch>,
    Zengbing Tu <tuzengbing@...iglobal.com>
Subject: Re: [PATCH net-next v3 1/4] net: bonding: add broadcast_neighbor option for 802.3ad


Tonghao Zhang <tonghao@...aicloud.com> wrote:
>> 2025年5月14日 17:38,Jay Vosburgh <jv@...sburgh.net> 写道:
>> 
>> Tonghao Zhang <tonghao@...aicloud.com> wrote:
>> 
>>> Stacking technology is a type of technology used to expand ports on
>>> Ethernet switches. It is widely used as a common access method in
>>> large-scale Internet data center architectures. Years of practice
>>> have proved that stacking technology has advantages and disadvantages
>>> in high-reliability network architecture scenarios. For instance,
>>> in stacking networking arch, conventional switch system upgrades
>>> require multiple stacked devices to restart at the same time.
>>> Therefore, it is inevitable that the business will be interrupted
>>> for a while. It is for this reason that "no-stacking" in data centers
>>> has become a trend. Additionally, when the stacking link connecting
>>> the switches fails or is abnormal, the stack will split. Although it is
>>> not common, it still happens in actual operation. The problem is that
>>> after the split, it is equivalent to two switches with the same configuration
>>> appearing in the network, causing network configuration conflicts and
>>> ultimately interrupting the services carried by the stacking system.
>>> 
>>> To improve network stability, "non-stacking" solutions have been increasingly
>>> adopted, particularly by public cloud providers and tech companies
>>> like Alibaba, Tencent, and Didi. "non-stacking" is a method of mimicing switch
>>> stacking that convinces a LACP peer, bonding in this case, connected to a set of
>>> "non-stacked" switches that all of its ports are connected to a single
>>> switch (i.e., LACP aggregator), as if those switches were stacked. This
>>> enables the LACP peer's ports to aggregate together, and requires (a)
>>> special switch configuration, described in the linked article, and (b)
>>> modifications to the bonding 802.3ad (LACP) mode to send all ARP / ND
>>> packets across all ports of the active aggregator.
>> 
>> Please reformat the above to wrap at 75 columns, otherwise the
>> text in "git log" will wrap 80 columns (because git log indents the log
>> message).
>Ok, I will format every commit log of patch.
>> 
>>> -----------     -----------
>>> |  switch1  |   |  switch2  |
>>> -----------     -----------
>>>        ^           ^
>>>        |           |
>>>      -----------------
>>>     |   bond4 lacp    |
>>>      -----------------
>>>        |           |
>>>        | NIC1      | NIC2
>>>      -----------------
>>>     |     server      |
>>>      -----------------
>>> 
>>> - https://www.ruijie.com/fr-fr/support/tech-gallery/de-stack-data-center-network-architecture/
>>> 
>>> Cc: Jay Vosburgh <jv@...sburgh.net>
>>> Cc: "David S. Miller" <davem@...emloft.net>
>>> Cc: Eric Dumazet <edumazet@...gle.com>
>>> Cc: Jakub Kicinski <kuba@...nel.org>
>>> Cc: Paolo Abeni <pabeni@...hat.com>
>>> Cc: Simon Horman <horms@...nel.org>
>>> Cc: Jonathan Corbet <corbet@....net>
>>> Cc: Andrew Lunn <andrew+netdev@...n.ch>
>>> Signed-off-by: Tonghao Zhang <tonghao@...aicloud.com>
>>> Signed-off-by: Zengbing Tu <tuzengbing@...iglobal.com>
>>> ---
>>> Documentation/networking/bonding.rst |  6 ++++
>>> drivers/net/bonding/bond_main.c      | 41 ++++++++++++++++++++++++++++
>>> drivers/net/bonding/bond_options.c   | 35 ++++++++++++++++++++++++
>>> include/net/bond_options.h           |  1 +
>>> include/net/bonding.h                |  3 ++
>>> 5 files changed, 86 insertions(+)
>>> 
>>> diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
>>> index a4c1291d2561..14f7593d888d 100644
>>> --- a/Documentation/networking/bonding.rst
>>> +++ b/Documentation/networking/bonding.rst
>>> @@ -562,6 +562,12 @@ lacp_rate
>>> 
>>> The default is slow.
>>> 
>>> +broadcast_neighbor
>>> +
>>> + Option specifying whether to broadcast ARP/ND packets to all
>>> + active slaves.  This option has no effect in modes other than
>>> + 802.3ad mode.  The default is off (0).
>>> +
>>> max_bonds
>>> 
>>> Specifies the number of bonding devices to create for this
>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>> index d05226484c64..db6a9c8db47c 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -212,6 +212,8 @@ atomic_t netpoll_block_tx = ATOMIC_INIT(0);
>>> 
>>> unsigned int bond_net_id __read_mostly;
>>> 
>>> +DEFINE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>> +
>>> static const struct flow_dissector_key flow_keys_bonding_keys[] = {
>>> {
>>> .key_id = FLOW_DISSECTOR_KEY_CONTROL,
>>> @@ -4461,6 +4463,9 @@ static int bond_open(struct net_device *bond_dev)
>>> 
>>> bond_for_each_slave(bond, slave, iter)
>>> dev_mc_add(slave->dev, lacpdu_mcast_addr);
>>> +
>>> + if (bond->params.broadcast_neighbor)
>>> + static_branch_inc(&bond_bcast_neigh_enabled);
>>> }
>>> 
>>> if (bond_mode_can_use_xmit_hash(bond))
>>> @@ -4480,6 +4485,9 @@ static int bond_close(struct net_device *bond_dev)
>>> bond_alb_deinitialize(bond);
>>> bond->recv_probe = NULL;
>>> 
>>> + if (bond->params.broadcast_neighbor)
>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>> +
>>> if (bond_uses_primary(bond)) {
>>> rcu_read_lock();
>>> slave = rcu_dereference(bond->curr_active_slave);
>>> @@ -5316,6 +5324,35 @@ static struct slave *bond_xdp_xmit_3ad_xor_slave_get(struct bonding *bond,
>>> return slaves->arr[hash % count];
>>> }
>>> 
>>> +static bool bond_should_broadcast_neighbor(struct sk_buff *skb,
>>> +    struct net_device *dev)
>>> +{
>>> + struct bonding *bond = netdev_priv(dev);
>>> +
>>> + if (!static_branch_unlikely(&bond_bcast_neigh_enabled))
>>> + return false;
>>> +
>>> + if (!bond->params.broadcast_neighbor)
>>> + return false;
>>> +
>>> + if (skb->protocol == htons(ETH_P_ARP))
>>> + return true;
>>> +
>>> + if (skb->protocol == htons(ETH_P_IPV6) &&
>>> +     pskb_may_pull(skb,
>>> +   sizeof(struct ipv6hdr) + sizeof(struct icmp6hdr))) {
>>> + if (ipv6_hdr(skb)->nexthdr == IPPROTO_ICMPV6) {
>>> + struct icmp6hdr *icmph = icmp6_hdr(skb);
>>> +
>>> + if ((icmph->icmp6_type == NDISC_NEIGHBOUR_SOLICITATION) ||
>>> +     (icmph->icmp6_type == NDISC_NEIGHBOUR_ADVERTISEMENT))
>>> + return true;
>>> + }
>>> + }
>> 
>> I'd recommend you look at bond_na_rcv() and mimic its logic for
>> inspecting the ipv6 and icmp6 headers by using skb_header_pointer()
>> instead of pskb_may_pull().  The skb_header_pointer() function does not
>> alter the skb (it will not move data into the skb linear area), and
>> would have lower cost for this use case.  With broadcast_neighbor
>> enabled, every IPv6 packet will pass through this test, so minimizing
>> its cost should be of benefit.
>Ok
>> 
>>> +
>>> + return false;
>>> +}
>>> +
>>> /* Use this Xmit function for 3AD as well as XOR modes. The current
>>> * usable slave array is formed in the control path. The xmit function
>>> * just calculates hash and sends the packet out.
>>> @@ -5583,6 +5620,9 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev
>>> case BOND_MODE_ACTIVEBACKUP:
>>> return bond_xmit_activebackup(skb, dev);
>>> case BOND_MODE_8023AD:
>>> + if (bond_should_broadcast_neighbor(skb, dev))
>>> + return bond_xmit_broadcast(skb, dev);
>> 
>> One question that just occured to me... is bond_xmit_broadcast()
>> actually the correct logic?  It will send the skb to every interface
>> that is a member of the bond, but for the "non stacked" case, I believe
>> the correct logic is to send only to interfaces that are part of the
>> active aggregator.
>> 
>> Stated another way, with broadcast_neighbor enabled, if the bond
>> has interfaces not part of the active aggregator (perhaps a switch port
>> or entire switch not correctly configured for "non stacked" operation),
>> the ARP / NS / NA packet has the potential to update neighbor tables
>> incorrectly.
>If sending the packets to inactive ports, the switch will drop the packets in our production environment, because lacp state of the port is down. I agree with you, the bond brodcast_neighbor enabled should only send packet to active ports.

	The situation I'm concerned with is probably outside the scope
of your particular depolyment, but would nevertheless be a legal
configuration.

	Bonding can negotiate LACP with multiple separate LACP peers,
traditionally this would be separate switches or separate channel groups
on a single switch.  Either way, bonding will have multiple aggregators
(sets of ports aggregated together via LACP), and select one to be the
active aggregator.  The other aggregators will not be used to pass
traffic, but LACP protocol exchanges still take place.  If the active
aggregator were to fail, then bonding would switch over to another
already established aggregator.

	In this situation, with multiple aggregators, the current
broadcast mode logic would send packets to the non-selected
aggregator(s), which would likely process the traffic.

	-J

>> 
>> Assuming that's correct (that it should exclude interfaces not
>> in the active aggregator), that raises the question of what is the
>> correct behavior if there is no active aggregator, but only ports in
>> individual mode.  I would expect that in that case it should utilize the
>> usual LACP transmit logic (sending on just the individual port that
>> bonding has selected as active).
>> 
>> -J
>> 
>>> + fallthrough;
>>> case BOND_MODE_XOR:
>>> return bond_3ad_xor_xmit(skb, dev);
>>> case BOND_MODE_BROADCAST:
>>> @@ -6462,6 +6502,7 @@ static int __init bond_check_params(struct bond_params *params)
>>> eth_zero_addr(params->ad_actor_system);
>>> params->ad_user_port_key = ad_user_port_key;
>>> params->coupled_control = 1;
>>> + params->broadcast_neighbor = 0;
>>> if (packets_per_slave > 0) {
>>> params->reciprocal_packets_per_slave =
>>> reciprocal_value(packets_per_slave);
>>> diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>>> index 91893c29b899..7f0939337231 100644
>>> --- a/drivers/net/bonding/bond_options.c
>>> +++ b/drivers/net/bonding/bond_options.c
>>> @@ -87,6 +87,8 @@ static int bond_option_missed_max_set(struct bonding *bond,
>>>       const struct bond_opt_value *newval);
>>> static int bond_option_coupled_control_set(struct bonding *bond,
>>>    const struct bond_opt_value *newval);
>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>> +    const struct bond_opt_value *newval);
>>> 
>>> static const struct bond_opt_value bond_mode_tbl[] = {
>>> { "balance-rr",    BOND_MODE_ROUNDROBIN,   BOND_VALFLAG_DEFAULT},
>>> @@ -240,6 +242,12 @@ static const struct bond_opt_value bond_coupled_control_tbl[] = {
>>> { NULL,  -1, 0},
>>> };
>>> 
>>> +static const struct bond_opt_value bond_broadcast_neigh_tbl[] = {
>>> + { "off", 0, BOND_VALFLAG_DEFAULT},
>>> + { "on",  1, 0},
>>> + { NULL,  -1, 0}
>>> +};
>>> +
>>> static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>> [BOND_OPT_MODE] = {
>>> .id = BOND_OPT_MODE,
>>> @@ -513,6 +521,14 @@ static const struct bond_option bond_opts[BOND_OPT_LAST] = {
>>> .flags = BOND_OPTFLAG_IFDOWN,
>>> .values = bond_coupled_control_tbl,
>>> .set = bond_option_coupled_control_set,
>>> + },
>>> + [BOND_OPT_BROADCAST_NEIGH] = {
>>> + .id = BOND_OPT_BROADCAST_NEIGH,
>>> + .name = "broadcast_neighbor",
>>> + .desc = "Broadcast neighbor packets to all slaves",
>>> + .unsuppmodes = BOND_MODE_ALL_EX(BIT(BOND_MODE_8023AD)),
>>> + .values = bond_broadcast_neigh_tbl,
>>> + .set = bond_option_broadcast_neigh_set,
>>> }
>>> };
>>> 
>>> @@ -1840,3 +1856,22 @@ static int bond_option_coupled_control_set(struct bonding *bond,
>>> bond->params.coupled_control = newval->value;
>>> return 0;
>>> }
>>> +
>>> +static int bond_option_broadcast_neigh_set(struct bonding *bond,
>>> +    const struct bond_opt_value *newval)
>>> +{
>>> + if (bond->params.broadcast_neighbor == newval->value)
>>> + return 0;
>>> +
>>> + bond->params.broadcast_neighbor = newval->value;
>>> + if (bond->dev->flags & IFF_UP) {
>>> + if (bond->params.broadcast_neighbor)
>>> + static_branch_inc(&bond_bcast_neigh_enabled);
>>> + else
>>> + static_branch_dec(&bond_bcast_neigh_enabled);
>>> + }
>>> +
>>> + netdev_dbg(bond->dev, "Setting broadcast_neighbor to %s (%llu)\n",
>>> +    newval->string, newval->value);
>>> + return 0;
>>> +}
>>> diff --git a/include/net/bond_options.h b/include/net/bond_options.h
>>> index 18687ccf0638..022b122a9fb6 100644
>>> --- a/include/net/bond_options.h
>>> +++ b/include/net/bond_options.h
>>> @@ -77,6 +77,7 @@ enum {
>>> BOND_OPT_NS_TARGETS,
>>> BOND_OPT_PRIO,
>>> BOND_OPT_COUPLED_CONTROL,
>>> + BOND_OPT_BROADCAST_NEIGH,
>>> BOND_OPT_LAST
>>> };
>>> 
>>> diff --git a/include/net/bonding.h b/include/net/bonding.h
>>> index 95f67b308c19..e06f0d63b2c1 100644
>>> --- a/include/net/bonding.h
>>> +++ b/include/net/bonding.h
>>> @@ -115,6 +115,8 @@ static inline int is_netpoll_tx_blocked(struct net_device *dev)
>>> #define is_netpoll_tx_blocked(dev) (0)
>>> #endif
>>> 
>>> +DECLARE_STATIC_KEY_FALSE(bond_bcast_neigh_enabled);
>>> +
>>> struct bond_params {
>>> int mode;
>>> int xmit_policy;
>>> @@ -149,6 +151,7 @@ struct bond_params {
>>> struct in6_addr ns_targets[BOND_MAX_NS_TARGETS];
>>> #endif
>>> int coupled_control;
>>> + int broadcast_neighbor;
>>> 
>>> /* 2 bytes of padding : see ether_addr_equal_64bits() */
>>> u8 ad_actor_system[ETH_ALEN + 2];
>>> -- 
>>> 2.34.1

---
	-Jay Vosburgh, jv@...sburgh.net

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ