[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CALOAHbDOMD9+FUw6aog=AKa-6+a_9tuqaQhq3LKXLzFgzBreSg@mail.gmail.com>
Date: Wed, 26 Nov 2025 20:29:57 +0800
From: Yafang Shao <laoar.shao@...il.com>
To: Saeed Mahameed <saeedm@...dia.com>, tariqt@...dia.com, mbloch@...dia.com,
Nikolay Aleksandrov <razor@...ckwall.org>, idosch@...dia.com, jv@...sburgh.net,
andrew+netdev@...n.ch, David Miller <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>
Cc: netdev <netdev@...r.kernel.org>, bridge@...ts.linux.dev
Subject: [BUG] mlx5e: ARP broadcast over offloaded bridge fails when using a
bonded LACP interface
Hello mlx5e/bridge/bonding maintainers,
Technical Context
------------------------
In our Kubernetes production environment, we have enabled SR-IOV on
ConnectX-6 DX NICs to allow multiple containers to share a single
Physical Function (PF). In other words, each container is assigned its
own Virtual Function (VF).
We have two CX6-DX NICs on a single host, configured in a bonded
interface. The VFs and the bonding device are connected via a Linux
bridge. Both CX6-DX NICs are operating in switchdev mode, and the
Linux bridge is offloaded. The topology is as follows:
Container0
|
VF reppresentor
|
linux bridge (offloaded)
|
bond
|
+----------------------+
| |
eth0 (PF) eth1 (PF)
Both eth0 and eth1 are on switchdev mode.
$ cat /sys/class/net/eth0/compat/devlink/mode
switchdev
$ cat /sys/class/net/eth1/compat/devlink/mode
swithcdev
This setup follows the guidance provided in the official NVIDIA
documentation [0].
The bond is configured in 802.3ad (LACP) mode. We enabled the ARP
broadcast feature on the bonding device, which is functionally similar
to the patchset referenced here:
https://lore.kernel.org/all/cover.1750642572.git.tonghao@bamaicloud.com/
Issue Description
------------------------
When pinging another host from Container0, we observe that only one
ARP request is sent out, even though both eth0 and eth1 appear to be
generating ARP requests at the PF level, and the doorbell is being
triggered.
tcpdump output:
$ tcpdump -i any -n arp host 10.247.209.128
17:50:16.309758 eth1_5 B ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
17:50:16.309777 eth0_5 Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
17:50:16.309784 bond0 Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
17:50:16.309786 eth0 Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
17:50:16.309788 eth1 Out ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
17:50:16.309758 bridge0 B ARP, Request who-has 10.247.209.128 tell
10.247.209.140, length 38
No ARP reply is captured by tcpdump.
However, ARP broadcast appears to be functional, as two ARP requests
are triggering the doorbell, as traced by the following bpftrace
script:
kprobe:mlx5e_sq_xmit_wqe
{
$skb = (struct sk_buff *)arg1;
if ($skb->protocol != 0x608) { // arp
return;
}
$iph = (struct iphdr *)($skb->head + $skb->network_header);
$arp_data = $skb->head + $skb->network_header + sizeof(struct arphdr);
$arph = (struct arphdr *)($skb->head + $skb->network_header);
$smac = $arp_data;
$sip = $arp_data + 6;
$tip = $arp_data + 16;
// 10.247.209.128
if (!($tip[0] == 10 && $tip[1] == 247 && $tip[2] == 209 &&
$tip[3] == 128)) {
return;
}
$dev = $skb->dev;
$dev_name = $skb->dev->name;
printf("Device:%s(%02x:%02x:%02x:%02x:%02x:%02x) [%d] Sender
IP:%d.%d.%d.%d (%02x:%02x:%02x:%02x:%02x:%02x), ",
$dev_name, $dev->dev_addr[0], $dev->dev_addr[1],
$dev->dev_addr[2], $dev->dev_addr[3],
$dev->dev_addr[4], $dev->dev_addr[5], $dev->ifindex,
$sip[0], $sip[1], $sip[2], $sip[3],
$smac[0], $smac[1], $smac[2], $smac[3], $smac[4], $smac[5]);
printf("Target IP:%d.%d.%d.%d, OP:%d\n",
$tip[0], $tip[1], $tip[2], $tip[3],
(($arph->ar_op & 0xFF00) >> 8) | (($arph->ar_op &
0x00FF) << 8));
}
bpftrace output:
Device:eth0_5(ba:d2:2d:ff:80:82) [68] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
Device:eth0(e0:9d:73:c3:d2:3e) [2] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
Device:eth1(e0:9d:73:c3:d2:3e) [3] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
And the detailed callstack with bpftrace:
Device:eth0_5(ba:d2:2d:ff:80:82) [68] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
mlx5e_sq_xmit_wqe+1
dev_hard_start_xmit+142
sch_direct_xmit+161
__dev_xmit_skb+482
__dev_queue_xmit+637
br_dev_queue_push_xmit+194
br_forward_finish+83
br_nf_hook_thresh+220
br_nf_forward_finish+381
br_nf_forward_arp+647
nf_hook_slow+65
__br_forward+214
maybe_deliver+188
br_flood+118
br_handle_frame_finish+421
br_handle_frame+781
__netif_receive_skb_core.constprop.0+651
__netif_receive_skb_list_core+291
netif_receive_skb_list_internal+459
napi_complete_done+122
mlx5e_napi_poll+358
__napi_poll.constprop.0+46
net_rx_action+680
__do_softirq+271
irq_exit_rcu+82
common_interrupt+142
asm_common_interrupt+39
cpuidle_enter_state+237
cpuidle_enter+52
cpuidle_idle_call+261
do_idle+124
cpu_startup_entry+32
start_secondary+296
secondary_startup_64_no_verify+229
Device:eth0(e0:9d:73:c3:d2:3e) [2] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
mlx5e_sq_xmit_wqe+1
dev_hard_start_xmit+142
sch_direct_xmit+161
__dev_xmit_skb+482
__dev_queue_xmit+637
bond_dev_queue_xmit+43
__bond_start_xmit+590
bond_start_xmit+70
dev_hard_start_xmit+142
__dev_queue_xmit+1260
br_dev_queue_push_xmit+194
br_forward_finish+83
br_nf_hook_thresh+220
br_nf_forward_finish+381
br_nf_forward_arp+647
nf_hook_slow+65
__br_forward+214
br_flood+266
br_handle_frame_finish+421
br_handle_frame+781
__netif_receive_skb_core.constprop.0+651
__netif_receive_skb_list_core+291
netif_receive_skb_list_internal+459
napi_complete_done+122
mlx5e_napi_poll+358
__napi_poll.constprop.0+46
net_rx_action+680
__do_softirq+271
irq_exit_rcu+82
common_interrupt+142
asm_common_interrupt+39
cpuidle_enter_state+237
cpuidle_enter+52
cpuidle_idle_call+261
do_idle+124
cpu_startup_entry+32
start_secondary+296
secondary_startup_64_no_verify+229
Device:eth1(e0:9d:73:c3:d2:3e) [3] Sender IP:10.247.209.140
(06:99:0f:34:b1:15), Target IP:10.247.209.128, OP:1
mlx5e_sq_xmit_wqe+1
dev_hard_start_xmit+142
sch_direct_xmit+161
__dev_xmit_skb+482
__dev_queue_xmit+637
bond_dev_queue_xmit+43
__bond_start_xmit+590
bond_start_xmit+70
dev_hard_start_xmit+142
__dev_queue_xmit+1260
br_dev_queue_push_xmit+194
br_forward_finish+83
br_nf_hook_thresh+220
br_nf_forward_finish+381
br_nf_forward_arp+647
nf_hook_slow+65
__br_forward+214
br_flood+266
br_handle_frame_finish+421
br_handle_frame+781
__netif_receive_skb_core.constprop.0+651
__netif_receive_skb_list_core+291
netif_receive_skb_list_internal+459
napi_complete_done+122
mlx5e_napi_poll+358
__napi_poll.constprop.0+46
net_rx_action+680
__do_softirq+271
irq_exit_rcu+82
common_interrupt+142
asm_common_interrupt+39
cpuidle_enter_state+237
cpuidle_enter+52
cpuidle_idle_call+261
do_idle+124
cpu_startup_entry+32
start_secondary+296
secondary_startup_64_no_verify+229
Additionally, traffic captured on the uplink switch confirms that ARP
requests are only being sent via one NIC.
This suggests a potential issue with bridge offloading. However, we
lack the capability to trace hardware-level behavior directly.
Notably, `ethtool -S ethX` shows no packet drops or errors.
Questions
--------------
1. How can we further trace hardware behavior to diagnose this issue?
2. Is this a known limitation of bridge offloading in this configuration?
3. Are there any recommended solutions or workarounds?
This issue is reproducible. We are willing to recompile the mlx
drivers if additional information is needed.
Current driver version:
$ ethtool -i eth0
driver: mlx5_core
version: 24.10-1.1.4
firmware-version: 22.43.2026 (MT_0000000359)
expansion-rom-version:
bus-info: 0000:21:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes
[0].https://docs.nvidia.com/networking/display/public/sol/technology+preview+of+kubernetes+cluster+deployment+with+accelerated+bridge+cni+and+nvidia+ethernet+networking#src-119742530_TechnologyPreviewofKubernetesClusterDeploymentwithAcceleratedBridgeCNIandNVIDIAEthernetNetworking-Application
--
Regards
Yafang
Powered by blists - more mailing lists