[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAA85sZvfgZinRUbsXWJeS1kHojb3eZK_T-oanQTvtmLCgd98Lg@mail.gmail.com>
Date: Fri, 28 Feb 2025 13:55:38 +0100
From: Ian Kumlien <ian.kumlien@...il.com>
To: Nikolay Aleksandrov <razor@...ckwall.org>
Cc: netdev@...r.kernel.org, Ajit Khaparde <ajit.khaparde@...adcom.com>,
Sriharsha Basavapatna <sriharsha.basavapatna@...adcom.com>,
Somnath Kotur <somnath.kotur@...adcom.com>, Andrew Lunn <andrew+netdev@...n.ch>, davem@...emloft.net,
edumazet@...gle.com, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>
Subject: Re: [PATCH net] be2net: fix sleeping while atomic bugs in be_ndo_bridge_getlink
On Fri, Feb 28, 2025 at 1:49 PM Nikolay Aleksandrov <razor@...ckwall.org> wrote:
>
> On 2/28/25 14:46, Ian Kumlien wrote:
> > Actually, while you might already have realized this, I didn't quite
> > understand how important this fix seems to be....
> >
>
> You mean the be2net would send broken packets to this other machine with mlx5 card?
> Or did I misunderstand you?
That is correct, UDP so I assume it's the wireguard tunnel...
> > From another machine i found this:
> > [lör feb 22 23:46:32 2025] mlx5_core 0000:02:00.1 enp2s0f1np1: hw csum failure
> > [lör feb 22 23:46:32 2025] skb len=2488 headroom=78 headlen=1480 tailroom=0
> > mac=(64,14) mac_len=14 net=(78,20) trans=98
> > shinfo(txflags=0 nr_frags=0 gso(size=1452
> > type=393216 segs=2))
> > csum(0x2baef95d start=63837 offset=11182
> > ip_summed=2 complete_sw=0 valid=0 level=0)
> > hash(0xb9a84019 sw=0 l4=1) proto=0x0800
> > pkttype=0 iif=8
> > priority=0x0 mark=0x0 alloc_cpu=1 vlan_all=0x0
> > encapsulation=0 inner(proto=0x0000, mac=0,
> > net=0, trans=0)
> > [lör feb 22 23:46:32 2025] dev name=enp2s0f1np1 feat=0x0e12a1c21cd14ba9
> >
> > And:
> > [lör feb 22 23:46:33 2025] skb fraglist:
> > [lör feb 22 23:46:33 2025] skb len=1008 headroom=106 headlen=1008 tailroom=38
> > mac=(64,14) mac_len=14 net=(78,20) trans=98
> > shinfo(txflags=0 nr_frags=0 gso(size=0
> > type=0 segs=0))
> > csum(0x86f9 start=34553 offset=0
> > ip_summed=2 complete_sw=0 valid=0 level=0)
> > hash(0xb9a84019 sw=0 l4=1) proto=0x0800
> > pkttype=0 iif=0
> > priority=0x0 mark=0x0 alloc_cpu=1 vlan_all=0x0
> > encapsulation=0 inner(proto=0x0000, mac=0,
> > net=0, trans=0)
> > [lör feb 22 23:46:33 2025] dev name=enp2s0f1np1 feat=0x0e12a1c21cd14ba9
> >
> > Including:
> > [lör feb 22 23:46:34 2025] CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not
> > tainted 6.13.4 #449
> > [lör feb 22 23:46:34 2025] Hardware name: Supermicro Super
> > Server/A2SDi-12C-HLN4F, BIOS 1.9a 12/25/2023
> > [lör feb 22 23:46:34 2025] Call Trace:
> > [lör feb 22 23:46:34 2025] <IRQ>
> > [lör feb 22 23:46:34 2025] dump_stack_lvl+0x47/0x70
> > [lör feb 22 23:46:34 2025] __skb_checksum_complete+0xda/0xf0
> > [lör feb 22 23:46:34 2025] ? __pfx_csum_partial_ext+0x10/0x10
> > [lör feb 22 23:46:34 2025] ? __pfx_csum_block_add_ext+0x10/0x10
> > [lör feb 22 23:46:34 2025] nf_conntrack_udp_packet+0x171/0x260
> > [lör feb 22 23:46:34 2025] nf_conntrack_in+0x391/0x590
> > [lör feb 22 23:46:34 2025] nf_hook_slow+0x3c/0xf0
> > [lör feb 22 23:46:34 2025] nf_hook_slow_list+0x70/0xf0
> > [lör feb 22 23:46:34 2025] ip_sublist_rcv+0x1ee/0x200
> > [lör feb 22 23:46:34 2025] ? __pfx_ip_rcv_finish+0x10/0x10
> > [lör feb 22 23:46:34 2025] ip_list_rcv+0xf8/0x130
> > [lör feb 22 23:46:34 2025] __netif_receive_skb_list_core+0x24c/0x270
> > [lör feb 22 23:46:34 2025] netif_receive_skb_list_internal+0x18f/0x2b0
> > [lör feb 22 23:46:34 2025] ? mlx5e_handle_rx_cqe_mpwrq+0x116/0x210
> > [lör feb 22 23:46:34 2025] napi_complete_done+0x65/0x260
> > [lör feb 22 23:46:34 2025] mlx5e_napi_poll+0x172/0x760
> > [lör feb 22 23:46:34 2025] __napi_poll+0x26/0x160
> > [lör feb 22 23:46:34 2025] net_rx_action+0x173/0x300
> > [lör feb 22 23:46:34 2025] ? notifier_call_chain+0x54/0xc0
> > [lör feb 22 23:46:34 2025] ? atomic_notifier_call_chain+0x30/0x40
> > [lör feb 22 23:46:34 2025] handle_softirqs+0xcd/0x270
> > [lör feb 22 23:46:34 2025] irq_exit_rcu+0x85/0xa0
> > [lör feb 22 23:46:34 2025] common_interrupt+0x81/0xa0
> > [lör feb 22 23:46:34 2025] </IRQ>
> > [lör feb 22 23:46:34 2025] <TASK>
> > [lör feb 22 23:46:34 2025] asm_common_interrupt+0x22/0x40
> > [lör feb 22 23:46:34 2025] RIP: 0010:cpuidle_enter_state+0xbc/0x430
> > [lör feb 22 23:46:34 2025] Code: 77 02 00 00 e8 65 31 ec fe e8 60 f8
> > ff ff 49 89 c5 0f 1f 44 00 00 31 ff e8 a1 68 eb fe 45 84 ff 0f 85 49
> > 02 00 00 fb 45 85 f6 <0f> 88 8d 01 00 00 49 63 ce 4c 8b 14 24 48 8d 04
> > 49 48 8d 14 81 48
> > [lör feb 22 23:46:34 2025] RSP: 0018:ffffb504000b7e88 EFLAGS: 00000202
> > [lör feb 22 23:46:34 2025] RAX: ffff9c0a2fa40000 RBX: ffff9c0a2fa76e60
> > RCX: 0000000000000000
> > [lör feb 22 23:46:34 2025] RDX: 0000252e1dcfee30 RSI: fffffff3c1a65ecc
> > RDI: 0000000000000000
> > [lör feb 22 23:46:34 2025] RBP: 0000000000000002 R08: 0000000000000000
> > R09: 00000000000001f6
> > [lör feb 22 23:46:34 2025] R10: 0000000000000018 R11: ffff9c0a2fa6c3ac
> > R12: ffffffffaac2de60
> > [lör feb 22 23:46:34 2025] R13: 0000252e1dcfee30 R14: 0000000000000002
> > R15: 0000000000000000
> > [lör feb 22 23:46:34 2025] ? cpuidle_enter_state+0xaf/0x430
> > [lör feb 22 23:46:34 2025] cpuidle_enter+0x24/0x40
> > [lör feb 22 23:46:34 2025] do_idle+0x16e/0x1b0
> > [lör feb 22 23:46:34 2025] cpu_startup_entry+0x20/0x30
> > [lör feb 22 23:46:34 2025] start_secondary+0xf3/0x100
> > [lör feb 22 23:46:34 2025] common_startup_64+0x13e/0x148
> > [lör feb 22 23:46:34 2025] </TASK>
> > ---
> >
> > Asking gemini for help identified the machine in the basement as the
> > culprit - so it seems like it could send corrupt data - i haven't had
> > a closer look though
> >
>
> Interesting. :)
Yeah... While i don't trust AI as such, if i take this at face value
it's quite interesting:
---
What we can tell:
* The packet is an IPv4 UDP packet.
* The hardware checksum failure indicates a potential data corruption
issue. This could be caused by:
** A faulty network cable.
** A problem with the network card itself.
** Issues with network switches or routers along the path.
** Software bugs.
* The packet was large and was handled using GSO.
* The hex dump allows for a very deep packet inspection, in order to
diagnose the problem.
* The IP source and destination addresses can be obtained from the hex
dump, and then the traffic can be analysed.
---
> > On Thu, Feb 27, 2025 at 5:41 PM Nikolay Aleksandrov <razor@...ckwall.org> wrote:
> >>
Powered by blists - more mailing lists