lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CALOAHbDOMD9+FUw6aog=AKa-6+a_9tuqaQhq3LKXLzFgzBreSg@mail.gmail.com>
Date: Wed, 26 Nov 2025 20:29:57 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Saeed Mahameed <saeedm@...dia.com>, tariqt@...dia.com, mbloch@...dia.com, 
	Nikolay Aleksandrov <razor@...ckwall.org>, idosch@...dia.com, jv@...sburgh.net, 
	andrew+netdev@...n.ch, David Miller <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>
Cc: netdev <netdev@...r.kernel.org>, bridge@...ts.linux.dev
Subject: [BUG] mlx5e: ARP broadcast over offloaded bridge fails when using a
 bonded LACP interface

Hello mlx5e/bridge/bonding maintainers,

Technical Context
------------------------

In our Kubernetes production environment, we have enabled SR-IOV on
ConnectX-6 DX NICs to allow multiple containers to share a single
Physical Function (PF). In other words, each container is assigned its
own Virtual Function (VF).

We have two CX6-DX NICs on a single host, configured in a bonded
interface. The VFs and the bonding device are connected via a Linux
bridge. Both CX6-DX NICs are operating in switchdev mode, and the
Linux bridge is offloaded. The topology is as follows:

         Container0
                |
         VF reppresentor
                |
       linux bridge (offloaded)
                |
              bond
                |
    +----------------------+
    |                            |
eth0 (PF)               eth1 (PF)


Both eth0 and eth1 are on switchdev mode.

  $ cat /sys/class/net/eth0/compat/devlink/mode
   switchdev
  $ cat /sys/class/net/eth1/compat/devlink/mode
   swithcdev

This setup follows the guidance provided in the official NVIDIA
documentation [0].

The bond is configured in 802.3ad (LACP) mode. We enabled the ARP
broadcast feature on the bonding device, which is functionally similar
to the patchset referenced here:

   https://lore.kernel.org/all/cover.1750642572.git.tonghao@bamaicloud.com/

Issue Description
------------------------

When pinging another host from Container0, we observe that only one
ARP request is sent out, even though both eth0 and eth1 appear to be
generating ARP requests at the PF level, and the doorbell is being
triggered.

tcpdump output:

  $ tcpdump -i any -n arp host 10.247.209.128
  17:50:16.309758 eth1_5 B   ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309777 eth0_5 Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309784 bond0 Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309786 eth0  Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309788 eth1  Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
  17:50:16.309758 bridge0 B   ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38

No ARP reply is captured by tcpdump.

However, ARP broadcast appears to be functional, as two ARP requests
are triggering the doorbell, as traced by the following bpftrace
script:

kprobe:mlx5e_sq_xmit_wqe
{
        $skb = (struct sk_buff *)arg1;

        if ($skb->protocol != 0x608) { // arp
                return;
        }

        $iph = (struct iphdr *)($skb->head + $skb->network_header);
        $arp_data = $skb->head + $skb->network_header + sizeof(struct arphdr);
        $arph = (struct arphdr *)($skb->head + $skb->network_header);
        $smac = $arp_data;

        $sip = $arp_data + 6;
        $tip = $arp_data + 16;
        // 10.247.209.128
        if (!($tip[0] == 10 && $tip[1] == 247 && $tip[2] == 209 &&
$tip[3] == 128)) {
                return;
        }

        $dev = $skb->dev;
        $dev_name = $skb->dev->name;

        printf("Device:%s(%02x:%02x:%02x:%02x:%02x:%02x) [%d] Sender
IP:%d.%d.%d.%d (%02x:%02x:%02x:%02x:%02x:%02x), ",
                $dev_name, $dev->dev_addr[0], $dev->dev_addr[1],
$dev->dev_addr[2], $dev->dev_addr[3],
                $dev->dev_addr[4], $dev->dev_addr[5], $dev->ifindex,
                $sip[0], $sip[1], $sip[2], $sip[3],
                $smac[0], $smac[1], $smac[2], $smac[3], $smac[4], $smac[5]);

        printf("Target IP:%d.%d.%d.%d, OP:%d\n",
                $tip[0], $tip[1], $tip[2], $tip[3],
                 (($arph->ar_op & 0xFF00) >> 8) | (($arph->ar_op &
0x00FF) << 8));
}

bpftrace output:

  Device:eth0_5(ba:d2:2d:ff:80:82) [68] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
  Device:eth0(e0:9d:73:c3:d2:3e) [2] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
  Device:eth1(e0:9d:73:c3:d2:3e) [3] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1

And the detailed callstack with bpftrace:

Device:eth0_5(ba:d2:2d:ff:80:82) [68] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1

        mlx5e_sq_xmit_wqe+1
        dev_hard_start_xmit+142
        sch_direct_xmit+161
        __dev_xmit_skb+482
        __dev_queue_xmit+637
        br_dev_queue_push_xmit+194
        br_forward_finish+83
        br_nf_hook_thresh+220
        br_nf_forward_finish+381
        br_nf_forward_arp+647
        nf_hook_slow+65
        __br_forward+214
        maybe_deliver+188
        br_flood+118
        br_handle_frame_finish+421
        br_handle_frame+781
        __netif_receive_skb_core.constprop.0+651
        __netif_receive_skb_list_core+291
        netif_receive_skb_list_internal+459
        napi_complete_done+122
        mlx5e_napi_poll+358
        __napi_poll.constprop.0+46
        net_rx_action+680
        __do_softirq+271
        irq_exit_rcu+82
        common_interrupt+142
        asm_common_interrupt+39
        cpuidle_enter_state+237
        cpuidle_enter+52
        cpuidle_idle_call+261
        do_idle+124
        cpu_startup_entry+32
        start_secondary+296
        secondary_startup_64_no_verify+229

Device:eth0(e0:9d:73:c3:d2:3e) [2] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1

        mlx5e_sq_xmit_wqe+1
        dev_hard_start_xmit+142
        sch_direct_xmit+161
        __dev_xmit_skb+482
        __dev_queue_xmit+637
        bond_dev_queue_xmit+43
        __bond_start_xmit+590
        bond_start_xmit+70
        dev_hard_start_xmit+142
        __dev_queue_xmit+1260
        br_dev_queue_push_xmit+194
        br_forward_finish+83
        br_nf_hook_thresh+220
        br_nf_forward_finish+381
        br_nf_forward_arp+647
        nf_hook_slow+65
        __br_forward+214
        br_flood+266
        br_handle_frame_finish+421
        br_handle_frame+781
        __netif_receive_skb_core.constprop.0+651
        __netif_receive_skb_list_core+291
        netif_receive_skb_list_internal+459
        napi_complete_done+122
        mlx5e_napi_poll+358
        __napi_poll.constprop.0+46
        net_rx_action+680
        __do_softirq+271
        irq_exit_rcu+82
        common_interrupt+142
        asm_common_interrupt+39
        cpuidle_enter_state+237
        cpuidle_enter+52
        cpuidle_idle_call+261
        do_idle+124
        cpu_startup_entry+32
        start_secondary+296
        secondary_startup_64_no_verify+229

Device:eth1(e0:9d:73:c3:d2:3e) [3] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1

        mlx5e_sq_xmit_wqe+1
        dev_hard_start_xmit+142
        sch_direct_xmit+161
        __dev_xmit_skb+482
        __dev_queue_xmit+637
        bond_dev_queue_xmit+43
        __bond_start_xmit+590
        bond_start_xmit+70
        dev_hard_start_xmit+142
        __dev_queue_xmit+1260
        br_dev_queue_push_xmit+194
        br_forward_finish+83
        br_nf_hook_thresh+220
        br_nf_forward_finish+381
        br_nf_forward_arp+647
        nf_hook_slow+65
        __br_forward+214
        br_flood+266
        br_handle_frame_finish+421
        br_handle_frame+781
        __netif_receive_skb_core.constprop.0+651
        __netif_receive_skb_list_core+291
        netif_receive_skb_list_internal+459
        napi_complete_done+122
        mlx5e_napi_poll+358
        __napi_poll.constprop.0+46
        net_rx_action+680
        __do_softirq+271
        irq_exit_rcu+82
        common_interrupt+142
        asm_common_interrupt+39
        cpuidle_enter_state+237
        cpuidle_enter+52
        cpuidle_idle_call+261
        do_idle+124
        cpu_startup_entry+32
        start_secondary+296
        secondary_startup_64_no_verify+229

Additionally, traffic captured on the uplink switch confirms that ARP
requests are only being sent via one NIC.

This suggests a potential issue with bridge offloading. However, we
lack the capability to trace hardware-level behavior directly.
Notably, `ethtool -S ethX` shows no packet drops or errors.

Questions
--------------

1. How can we further trace hardware behavior to diagnose this issue?
2. Is this a known limitation of bridge offloading in this configuration?
3. Are there any recommended solutions or workarounds?

This issue is reproducible. We are willing to recompile the mlx
drivers if additional information is needed.

Current driver version:

  $ ethtool -i eth0
  driver: mlx5_core
  version: 24.10-1.1.4
  firmware-version: 22.43.2026 (MT_0000000359)
  expansion-rom-version:
  bus-info: 0000:21:00.0
  supports-statistics: yes
  supports-test: yes
  supports-eeprom-access: no
  supports-register-dump: no
  supports-priv-flags: yes

[0].https://docs.nvidia.com/networking/display/public/sol/technology+preview+of+kubernetes+cluster+deployment+with+accelerated+bridge+cni+and+nvidia+ethernet+networking#src-119742530_TechnologyPreviewofKubernetesClusterDeploymentwithAcceleratedBridgeCNIandNVIDIAEthernetNetworking-Application

-- 
Regards
Yafang

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ