[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <0b3ab776-2b8b-1725-d36e-70af66c138da@akamai.com>
Date: Wed, 11 Dec 2019 14:52:41 -0800
From: Josh Hunt <johunt@...mai.com>
To: steffen.klassert@...unet.com, herbert@...dor.apana.org.au,
David Miller <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>
Subject: crash in __xfrm_state_lookup on 4.19 LTS
We've hit the following crash on a handful of machines recently running
4.19.55 LTS and strongswan. The kernels running on these machines do
have some patches on top of 4.19 LTS, but nothing in the area of xfrm/ipsec:
[54284.354997] general protection fault: 0000 [#1] SMP PTI
[54284.355504] CPU: 6 PID: 11937 Comm: charon Tainted: G O L
4.19.55-4.19.2.4-amd64-2b86b5ea31726254 #1
[54284.356382] Hardware name: Ciara Technologies 1x8-X6 SSD 32G
10GE/CangJie, BIOS CC1F110D 08/12/2014
[54284.357322] RIP: 0010:__xfrm_state_lookup+0x7f/0x110
[54284.357856] Code: d0 4a 8d 04 c0 48 8b 00 48 85 c0 74 68 41 89 cf 49
89 d6 41 89 f5 eb 09 48 8b 43 28 48 85 c0 74 54 48 83 e8 28 48 89 c3 74
4b <66> 3b a8 d2 00 00 00 75 e5 44 3b 78 50
75 df 44 3a 60 54 75 d9 66
[54284.359190] RSP: 0018:ffffab5043d93ad0 EFLAGS: 00010212
[54284.359748] RAX: 6174735f79636e3d RBX: 6174735f79636e3d RCX:
0000000064959bc7
[54284.360219] RDX: ffff9bb0593c3380 RSI: 0000000000000000 RDI:
ffffffff951071c0
[54284.360713] RBP: 0000000000000002 R08: 0000000000000010 R09:
00000000001b950d
[54284.361209] R10: 000000000000003f R11: 0000000096001849 R12:
0000000000000032
[54284.361755] R13: 0000000000000000 R14: ffff9bb0593c3380 R15:
0000000064959bc7
[54284.362255] FS: 00007facd7b01700(0000) GS:ffff9bb07fb80000(0000)
knlGS:00000000000000000
[54284.363198] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54284.363687] CR2: 00007f99250e89e0 CR3: 00000007e1078006 CR4:
00000000001606e0
[54284.364156] Call Trace:
[54284.364642] xfrm_state_add+0x108/0x290
[54284.365113] xfrm_add_sa+0x9e6/0xb28 [xfrm_user]
[54284.365580] ? xfrm_user_rcv_msg+0x183/0x1a0 [xfrm_user]
[54284.366077] xfrm_user_rcv_msg+0x183/0x1a0 [xfrm_user]
[54284.366543] ? xfrm_dump_sa_done+0x30/0x30 [xfrm_user]
[54284.367040] netlink_rcv_skb+0xde/0x110
[54284.367504] xfrm_netlink_rcv+0x30/0x40 [xfrm_user]
[54284.368000] netlink_unicast+0x191/0x230
[54284.368463] netlink_sendmsg+0x2c4/0x390
[54284.368958] sock_sendmsg+0x36/0x40
[54284.369449] __sys_sendto+0xd8/0x150
[54284.369940] ? kern_select+0xb9/0xe0
[54284.370405] __x64_sys_sendto+0x24/0x30
[54284.370946] do_syscall_64+0x4e/0x110
[54284.383941] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54284.384497] RIP: 0033:0x7face4679ad3
(gdb) list *(__xfrm_state_lookup+0x7f)
0xffffffff8271beaf is in __xfrm_state_lookup (net/xfrm/xfrm_state.c:841).
warning: Source file is more recent than executable.
836 {
837 unsigned int h = xfrm_spi_hash(net, daddr, spi, proto, family);
838 struct xfrm_state *x;
839
840 hlist_for_each_entry_rcu(x, net->xfrm.state_byspi + h, byspi) {
841 if (x->props.family != family ||
842 x->id.spi != spi ||
843 x->id.proto != proto ||
844 !xfrm_addr_equal(&x->id.daddr, daddr, family))
845 continue;
The above looks similar to these very old reports:
https://wiki.strongswan.org/issues/2147
https://bugzilla.kernel.org/show_bug.cgi?id=84961
Prior to the crash we are seeing softlockups and rcu stalls (see
attached netconsole log file.) The RIP in those stalls/lockups appears
to be in the same area as the crash reported above, lines 840 and 841.
I've tried reproducing the problem in our lab, but have been
unsuccessful so far and running the latest upstream kernel in production
to see if that resolves the issue is not possible at the moment. It's
very possible this crash was happening on earlier kernel versions in our
network, I just don't have any data to confirm that.
Here's some possible relevant kernel config info, but can provide more
if requested:
# zgrep -E '(RCU|XFRM)' /proc/config.gz
# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
CONFIG_RCU_NOCB_CPU=y
CONFIG_HAVE_RCU_TABLE_FREE=y
CONFIG_HAVE_RCU_TABLE_INVALIDATE=y
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=m
CONFIG_XFRM_USER=m
# CONFIG_XFRM_INTERFACE is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_SECURITY_NETWORK_XFRM is not set
# RCU Debugging
# CONFIG_RCU_PERF_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=30
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
Please let me what other information would be useful to root-cause this
problem.
Thanks!
Josh
View attachment "xfrm-debug.log" of type "text/x-log" (723027 bytes)
Powered by blists - more mailing lists