linux-kernel - wireguard/napi stuck in napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALrw=nGoSW=M-SApcvkP4cfYwWRj=z7WonKi6fEksWjMZTs81A@mail.gmail.com>
Date: Mon, 23 Sep 2024 19:23:14 +0100
From: Ignat Korchagin <ignat@...udflare.com>
To: Jason@...c4.com, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	wireguard@...ts.zx2c4.com, netdev <netdev@...r.kernel.org>, 
	linux-kernel <linux-kernel@...r.kernel.org>, jiri@...nulli.us, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Lorenzo Bianconi <lorenzo@...nel.org>
Cc: kernel-team <kernel-team@...udflare.com>
Subject: wireguard/napi stuck in napi_disable

Hello,

We run calico on our Kubernetes cluster, which uses Wireguard to
encrypt in-cluster traffic [1]. Recently we tried to improve the
throughput of the cluster and eliminate some packet drops we’re seeing
by switching on threaded NAPI [2] on these managed Wireguard
interfaces. However, our Kubernetes hosts started to lock up once in a
while.

Analyzing one stuck host with drgn we were able to confirm that the
code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to
be cleared for the Wireguard peer napi instance, but that never
happens for some reason. For context the full state of the stuck napi
instance is 0b100110111. What makes things worse - this happens when
calico removes a Wireguard peer, which happens while holding the
global rtnl_mutex, so all the other tasks requiring that mutex get
stuck as well.

Full stacktrace of the “looping” task:

#0  context_switch (linux/kernel/sched/core.c:5380:2)
#1  __schedule (linux/kernel/sched/core.c:6698:8)
#2  schedule (linux/kernel/sched/core.c:6772:3)
#3  schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3)
#4  usleep_range_state (linux/kernel/time/timer.c:2363:8)
#5  usleep_range (linux/include/linux/delay.h:68:2)
#6  napi_disable (linux/net/core/dev.c:6477:4)
#7  peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2)
#8  set_peer (linux/drivers/net/wireguard/netlink.c:425:3)
#9  wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10)
#10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8)
#11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10)
#12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8)
#13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9)
#14 genl_rcv (linux/net/netlink/genetlink.c:1075:2)
#15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3)
#16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10)
#17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8)
#18 sock_sendmsg_nosec (linux/net/socket.c:730:12)
#19 __sock_sendmsg (linux/net/socket.c:745:16)
#20 ____sys_sendmsg (linux/net/socket.c:2560:8)
#21 ___sys_sendmsg (linux/net/socket.c:2614:8)
#22 __sys_sendmsg (linux/net/socket.c:2643:8)
#23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14)
#24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7)
#25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121)

We have also noticed that a similar issue is observed, when we switch
Wireguard threaded NAPI back to off: removing a Wireguard peer task
may still spend a considerable amount of time in the above loop (and
hold rtnl_mutex), however the host eventually recovers from this
state.

So the questions are:
1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the
threaded NAPI case in Wireguard?
2. Is it generally a good idea for Wireguard to loop for an
indeterminate amount of time, while holding the rtnl_mutex? Or can it
be refactored?

We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to
downgrade the kernel a couple of patch revisions, but it did not help
and our logs indicate that at least the non-threaded prolonged holding
of the rtnl_mutex is happening for a while now.

[1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
[2]: https://docs.kernel.org/networking/napi.html#threaded
[3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476