netdev - crash in __xfrm_state

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <0b3ab776-2b8b-1725-d36e-70af66c138da@akamai.com>
Date:   Wed, 11 Dec 2019 14:52:41 -0800
From:   Josh Hunt <johunt@...mai.com>
To:     steffen.klassert@...unet.com, herbert@...dor.apana.org.au,
        David Miller <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>
Subject: crash in __xfrm_state_lookup on 4.19 LTS

We've hit the following crash on a handful of machines recently running 
4.19.55 LTS and strongswan. The kernels running on these machines do 
have some patches on top of 4.19 LTS, but nothing in the area of xfrm/ipsec:

[54284.354997] general protection fault: 0000 [#1] SMP PTI
[54284.355504] CPU: 6 PID: 11937 Comm: charon Tainted: G           O L 
  4.19.55-4.19.2.4-amd64-2b86b5ea31726254 #1
[54284.356382] Hardware name: Ciara Technologies 1x8-X6 SSD 32G 
10GE/CangJie, BIOS CC1F110D 08/12/2014
[54284.357322] RIP: 0010:__xfrm_state_lookup+0x7f/0x110
[54284.357856] Code: d0 4a 8d 04 c0 48 8b 00 48 85 c0 74 68 41 89 cf 49 
89 d6 41 89 f5 eb 09 48 8b 43 28 48 85 c0 74 54 48 83 e8 28 48 89 c3 74 
4b <66> 3b a8 d2 00 00 00 75 e5 44 3b 78 50
  75 df 44 3a 60 54 75 d9 66
[54284.359190] RSP: 0018:ffffab5043d93ad0 EFLAGS: 00010212
[54284.359748] RAX: 6174735f79636e3d RBX: 6174735f79636e3d RCX: 
0000000064959bc7
[54284.360219] RDX: ffff9bb0593c3380 RSI: 0000000000000000 RDI: 
ffffffff951071c0
[54284.360713] RBP: 0000000000000002 R08: 0000000000000010 R09: 
00000000001b950d
[54284.361209] R10: 000000000000003f R11: 0000000096001849 R12: 
0000000000000032
[54284.361755] R13: 0000000000000000 R14: ffff9bb0593c3380 R15: 
0000000064959bc7
[54284.362255] FS:  00007facd7b01700(0000) GS:ffff9bb07fb80000(0000) 
knlGS:00000000000000000
[54284.363198] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54284.363687] CR2: 00007f99250e89e0 CR3: 00000007e1078006 CR4: 
00000000001606e0
[54284.364156] Call Trace:
[54284.364642]  xfrm_state_add+0x108/0x290
[54284.365113]  xfrm_add_sa+0x9e6/0xb28 [xfrm_user]
[54284.365580]  ? xfrm_user_rcv_msg+0x183/0x1a0 [xfrm_user]
[54284.366077]  xfrm_user_rcv_msg+0x183/0x1a0 [xfrm_user]
[54284.366543]  ? xfrm_dump_sa_done+0x30/0x30 [xfrm_user]
[54284.367040]  netlink_rcv_skb+0xde/0x110
[54284.367504]  xfrm_netlink_rcv+0x30/0x40 [xfrm_user]
[54284.368000]  netlink_unicast+0x191/0x230
[54284.368463]  netlink_sendmsg+0x2c4/0x390
[54284.368958]  sock_sendmsg+0x36/0x40
[54284.369449]  __sys_sendto+0xd8/0x150
[54284.369940]  ? kern_select+0xb9/0xe0
[54284.370405]  __x64_sys_sendto+0x24/0x30
[54284.370946]  do_syscall_64+0x4e/0x110
[54284.383941]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[54284.384497] RIP: 0033:0x7face4679ad3

(gdb) list *(__xfrm_state_lookup+0x7f)
0xffffffff8271beaf is in __xfrm_state_lookup (net/xfrm/xfrm_state.c:841).
warning: Source file is more recent than executable.
836	{
837		unsigned int h = xfrm_spi_hash(net, daddr, spi, proto, family);
838		struct xfrm_state *x;
839	
840		hlist_for_each_entry_rcu(x, net->xfrm.state_byspi + h, byspi) {
841			if (x->props.family != family ||
842			    x->id.spi       != spi ||
843			    x->id.proto     != proto ||
844			    !xfrm_addr_equal(&x->id.daddr, daddr, family))
845				continue;

The above looks similar to these very old reports:
https://wiki.strongswan.org/issues/2147
https://bugzilla.kernel.org/show_bug.cgi?id=84961

Prior to the crash we are seeing softlockups and rcu stalls (see 
attached netconsole log file.) The RIP in those stalls/lockups appears 
to be in the same area as the crash reported above, lines 840 and 841.

I've tried reproducing the problem in our lab, but have been 
unsuccessful so far and running the latest upstream kernel in production 
to see if that resolves the issue is not possible at the moment. It's 
very possible this crash was happening on earlier kernel versions in our 
network, I just don't have any data to confirm that.

Here's some possible relevant kernel config info, but can provide more 
if requested:

# zgrep -E '(RCU|XFRM)' /proc/config.gz
# RCU Subsystem
CONFIG_TREE_RCU=y
CONFIG_RCU_EXPERT=y
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
CONFIG_RCU_NOCB_CPU=y
CONFIG_HAVE_RCU_TABLE_FREE=y
CONFIG_HAVE_RCU_TABLE_INVALIDATE=y
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=m
CONFIG_XFRM_USER=m
# CONFIG_XFRM_INTERFACE is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
CONFIG_XFRM_STATISTICS=y
CONFIG_XFRM_IPCOMP=m
CONFIG_INET_XFRM_TUNNEL=m
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=m
CONFIG_INET6_XFRM_TUNNEL=m
CONFIG_INET6_XFRM_MODE_TRANSPORT=m
CONFIG_INET6_XFRM_MODE_TUNNEL=m
CONFIG_INET6_XFRM_MODE_BEET=m
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_SECURITY_NETWORK_XFRM is not set
# RCU Debugging
# CONFIG_RCU_PERF_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=30
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set

Please let me what other information would be useful to root-cause this 
problem.

Thanks!
Josh

View attachment "xfrm-debug.log" of type "text/x-log" (723027 bytes)