lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Thu, 11 Apr 2024 14:23:21 +0800
From: dracoding <dracodingfly@...il.com>
To: eric.dumazet@...il.com
Cc: edumazet@...gle.com,
	herbert@...dor.apana.org.au,
	jpiotrowski@...ux.microsoft.com,
	linux-kernel@...r.kernel.org,
	netdev@...r.kernel.org,
	seh@...ix.com
Subject: Re: kernel BUG at net/core/skbuff.c:4219

From: Jeremi Piotrowski <jpiotrowski@...ux.microsoft.com>

> On Tue, Oct 11, 2022 at 10:57:05AM -0700, Eric Dumazet wrote:
> > 
> > On 10/11/22 09:56, Jeremi Piotrowski wrote:
> > >Hi,
> > >
> > >One of our Flatcar users has been hitting the kernel BUG in the subject line
> > >for the past year (https://github.com/flatcar/Flatcar/issues/378). This was
> > >first reported when on 5.10.25, but has been happening across kernel updates,
> > >most recently with 5.15.63. The nodes where this happens are AWS EC2 instances,
> > >using ENA and calico networking in eBPF mode with VXLAN encapsulation. When
> > >GRO/GSO is enabled, the host hits this bug and prints the following stacktrace:
> > 
> > 
> > I suspect eBPF code lowers gso_size ?
> > 
> > gso stack is not able to arbitrarily segment a GRO packet after
> > gso_size being changed.
> > 
> > 
> 
> This was a good hint, see Tomas' response for some more observations.
> 
> This appears to still be happening with Calico v3.23 which started passing
> BPF_F_ADJ_ROOM_FIXED_GSO to bpf_skb_adjust_room() on the decap (rx) path.
> BPF_F_ADJ_ROOM_FIXED_GSO is not passed on the encap (tx) path. It is enough to
> disable GRO to stop the BUG from being hit though, so there must be more going
> on here ? (since the rx path does not change gso_size any longer).
>

Hi,

I encountered a similar error. The calico version is v3.24.5.
It was crash at BUG_ON(skb_headlen(list_skb) > len) with the following stacktrace.
But i don't konw how to reproduce it.

    [exception RIP: skb_segment+3016]
    RIP: ffffffffb97df2a8  RSP: ffffa3f2cce08728  RFLAGS: 00010293
    RAX: 000000000000007d  RBX: 00000000fffff7b3  RCX: 0000000000000011
    RDX: 0000000000000000  RSI: ffff895ea32c76c0  RDI: 00000000000008c1
    RBP: ffffa3f2cce087f8   R8: 000000000000088f   R9: 0000000000000011
    R10: 000000000000090c  R11: ffff895e47e68000  R12: ffff895eb2022f00
    R13: 000000000000004b  R14: ffff895ecdaf2000  R15: ffff895eb2023f00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffffa3f2cce08720] skb_segment at ffffffffb97ded63
#10 [ffffa3f2cce08800] tcp_gso_segment at ffffffffb98d0320
#11 [ffffa3f2cce08860] tcp4_gso_segment at ffffffffb98d07a3
#12 [ffffa3f2cce08880] inet_gso_segment at ffffffffb98e6de0
#13 [ffffa3f2cce088e0] skb_mac_gso_segment at ffffffffb97f3741
#14 [ffffa3f2cce08918] skb_udp_tunnel_segment at ffffffffb98daa59
#15 [ffffa3f2cce08980] udp4_ufo_fragment at ffffffffb98db471
#16 [ffffa3f2cce089b0] inet_gso_segment at ffffffffb98e6de0
#17 [ffffa3f2cce08a10] skb_mac_gso_segment at ffffffffb97f3741
#18 [ffffa3f2cce08a48] __skb_gso_segment at ffffffffb97f388e
#19 [ffffa3f2cce08a78] validate_xmit_skb at ffffffffb97f3d6e
#20 [ffffa3f2cce08ab8] __dev_queue_xmit at ffffffffb97f4614
#21 [ffffa3f2cce08b50] dev_queue_xmit at ffffffffb97f5030
#22 [ffffa3f2cce08b60] __bpf_redirect at ffffffffb98199a8
#23 [ffffa3f2cce08b88] skb_do_redirect at ffffffffb98205cd
#24 [ffffa3f2cce08bb8] __netif_receive_skb_core at ffffffffb97f6585
#25 [ffffa3f2cce08c68] __netif_receive_skb_list_core at ffffffffb97f6c0a
#26 [ffffa3f2cce08ce8] netif_receive_skb_list_internal at ffffffffb97f6f6a
#27 [ffffa3f2cce08d60] gro_normal_list at ffffffffb97f717e
#28 [ffffa3f2cce08d80] gro_normal_one at ffffffffb97f721c
#29 [ffffa3f2cce08db8] napi_gro_complete at ffffffffb97f72ac
#30 [ffffa3f2cce08de0] napi_gro_flush at ffffffffb97f73c1
#31 [ffffa3f2cce08e30] napi_complete_done at ffffffffb97f7d1e
#32 [ffffa3f2cce08e60] ice_napi_poll at ffffffffc0477dd6 [ice]
#33 [ffffa3f2cce08ec0] __napi_poll at ffffffffb97f823e
#34 [ffffa3f2cce08ef0] net_rx_action at ffffffffb97f86f1
#35 [ffffa3f2cce08f70] __softirqentry_text_start at ffffffffb9e000dd
#36 [ffffa3f2cce08fd8] irq_exit_rcu at ffffffffb9096074
#37 [ffffa3f2cce08ff0] common_interrupt at ffffffffb9a3272a

the gso_size is 75 which may subtract 50(the vxlan head length) by bpf_skb_adjust_room?���
the frag_list has one element which head_frag is 1. the skb_shared_info struct is as following.

struct skb_shared_info {
    nr_frags = 17 '\021',��
    gso_size = 75,��
    gso_segs = 0,��
    frag_list = 0xffff895eb2022f00,��
    gso_type = 1035,��
    destructor_arg = 0x2d656c6261747372,��
    frags = {{
  ��   �� bv_page = 0xfffff80e86d4d180,��
  �� ��   bv_len = 125,��
  �� ��   bv_offset = 2306
  ��   },
    ....
    }
}

If anyone has any suggestions excepth disabling GRO/GSO. The BPF_F_ADJ_ROOM_FIXED_GSO flag 
can be enabled on the encap path? I���d love to provide more information if you need.

fred

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ