lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Date:   Tue, 21 Jun 2022 17:58:52 +0100
From:   Frank Hofmann <fhofmann@...udflare.com>
To:     netdev@...r.kernel.org
Cc:     kernel-team <kernel-team@...udflare.com>
Subject: [Q]: "kernel BUG at net/core/skbuff.c:2185!" - known issue ?

Hi,

we're seeing BUG splats with stacks like:

[6514198.051700][   C75] ------------[ cut here ]------------
[6514198.066833][   C75] kernel BUG at net/core/skbuff.c:2194!
[6514198.081919][   C75] invalid opcode: 0000 [#1] SMP NOPTI
[6514198.096676][   C75] CPU: 75 PID: 0 Comm: swapper/75 Tainted: G
       O      5.15.32-cloudflare-2022.3.17 #1 [6514198.125512][   C75]
Hardware name: HYVE EDGE-METAL-GEN10/HS-1811DLite1, BIOS V2.80-sig
03/21/2022 [6514198.152869][   C75] RIP:
0010:__pskb_pull_tail+0x3b6/0x3d0 [6514198.167637][   C75] Code: 34 3a
e8 6d fc ff ff 48 8b 7c 24 08 48 85 c0 75 b9 48 89 df e8 cb cf ff ff
48 83 c4 10 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 4a 8d 14 06
44 0f b6 7a 02 e9 94 fe ff ff 4c 89 f7 31 db e9 [6514198.214396][
C75] RSP: 0018:ffff93accd5fc9a8 EFLAGS: 00010282 [6514198.229430][
C75] RAX: 00000000fffffff2 RBX: 00000000000005e8 RCX: 00000000000005b4
[6514198.246318][   C75] RDX: ffff91ac797d2100 RSI: ffff91ac797d2000
RDI: 0000000000000ec0 [6514198.263075][   C75] RBP: ffff93accd5fc9e0
R08: 0000000000001000 R09: 0000000000000001 [6514198.263079][   C75]
R10: ffff91ac797d2000 R11: 0000000000000002 R12: ffff9190c0b08ae0
[6514198.263081][   C75] R13: 00000000000005b4 R14: ffffffffc05673b8
R15: ffff93accd5fcb60 [6514198.263082][   C75] FS:
0000000000000000(0000) GS:ffff91b03fcc0000(0000)
knlGS:0000000000000000 [6514198.263084][   C75] CS:  0010 DS: 0000 ES:
0000 CR0: 0000000080050033 [6514198.263086][   C75] CR2:
000000c002d8d000 CR3: 0000002c8aaa8000 CR4: 0000000000350ee0
[6514198.263088][   C75] Call Trace: [6514198.263091][   C75]  <IRQ>
[6514198.390262][   C75]  skb_ensure_writable+0x85/0xa0
[6514198.402710][   C75]  tcpmss_mangle_packet+0x77/0x4d0 [xt_TCPMSS]
[6514198.402719][   C75]  ? ip_set_test+0xaa/0x170 [ip_set]
[6514198.428794][   C75]  ? set_match_v4+0xa0/0xd0 [xt_set]
[6514198.441327][   C75]  tcpmss_tg4+0x31/0x9b [xt_TCPMSS]
[6514198.441335][   C75]  ipt_do_table+0x300/0x650 [ip_tables]
[6514198.465724][   C75]  nf_hook_slow+0x41/0xb0 [6514198.476603][
C75]  ip_output+0xdb/0x120 [6514198.476611][   C75]  ?
__ip_finish_output+0x1a0/0x1a0 [6514198.476615][   C75]
__ip_queue_xmit+0x172/0x400 [6514198.509393][   C75]  ?
sk_stream_alloc_skb+0x63/0x2b0 [6514198.520494][   C75]
__tcp_transmit_skb+0xa38/0xbd0 [6514198.531274][   C75]
__tcp_retransmit_skb+0x181/0x890 [6514198.542055][   C75]  ?
enqueue_task_fair+0xf5/0x680 [6514198.552537][   C75]  ?
bbr_set_state+0x75/0x80 [tcp_bbr] [6514198.563453][   C75]
tcp_retransmit_skb+0x12/0x80 [6514198.573731][   C75]
tcp_retransmit_timer+0x392/0x950 [6514198.584262][   C75]
tcp_write_timer_handler+0x16c/0x250 [6514198.594954][   C75]
tcp_write_timer+0x8d/0xc0 [6514198.604714][   C75]  ?
tcp_write_timer_handler+0x250/0x250 [6514198.615264][   C75]
call_timer_fn+0x26/0xf0 [6514198.624515][   C75]
__run_timers.part.0+0x1b3/0x220 [6514198.634459][   C75]  ?
__hrtimer_run_queues+0x152/0x270 [6514198.644429][   C75]  ?
recalibrate_cpu_khz+0x10/0x10 [6514198.653992][   C75]  ?
ktime_get+0x38/0xa0 [6514198.662585][   C75]
run_timer_softirq+0x56/0xd0 [6514198.671645][   C75]
__do_softirq+0xbf/0x25c [6514198.680310][   C75]
irq_exit_rcu+0x7f/0xa0 [6514198.688834][   C75]
sysvec_apic_timer_interrupt+0x72/0x90

it's not a "frequent" occurrance; about once per month across our
fleet, different systems / different locations.

The codepath is always the same, TCP retransmit -> mangle hook tcp_mss
-> __pksb_pull_tail, and hits the splat at
https://elixir.bootlin.com/linux/v5.15.32/source/net/core/skbuff.c#L2194

Also context: Our kernel is "almost-stock" 5.15.32 - we carry less
than ten feature and driver patches that aren't mainlined or
backported to linux-stable. We touch net/core/filter.c for BPF socket
lookup enhancements not in linux-stable yet, but no other changes to
net/core vs. mainline.

Is this known ?

I've stumbled over this report from last year, where the same BUG()
line was hit but via a different codepath,
https://www.spinics.net/lists/netdev/msg768712.html - not noticed a
follow up there though.

Thanks in advance,
Frank Hofmann

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ