netdev - Re: wireguard/napi stuck in napi

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+wXwBTT74RErDGAnj98PqS=wvdh8eM1pi4q6tTdExtjnokKqA@mail.gmail.com>
Date: Wed, 25 Sep 2024 16:06:24 +0100
From: Daniel Dao <dqminh@...udflare.com>
To: Ignat Korchagin <ignat@...udflare.com>
Cc: Jason@...c4.com, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	wireguard@...ts.zx2c4.com, netdev <netdev@...r.kernel.org>, 
	linux-kernel <linux-kernel@...r.kernel.org>, jiri@...nulli.us, 
	Sebastian Andrzej Siewior <bigeasy@...utronix.de>, Lorenzo Bianconi <lorenzo@...nel.org>, 
	kernel-team <kernel-team@...udflare.com>, Tejun Heo <tj@...nel.org>
Subject: Re: wireguard/napi stuck in napi_disable

On Mon, Sep 23, 2024 at 10:33 PM Ignat Korchagin <ignat@...udflare.com> wrote:
>
> On Mon, Sep 23, 2024 at 7:23 PM Ignat Korchagin <ignat@...udflare.com> wrote:
> >
> > Hello,
> >
> > We run calico on our Kubernetes cluster, which uses Wireguard to
> > encrypt in-cluster traffic [1]. Recently we tried to improve the
> > throughput of the cluster and eliminate some packet drops we’re seeing
> > by switching on threaded NAPI [2] on these managed Wireguard
> > interfaces. However, our Kubernetes hosts started to lock up once in a
> > while.
> >
> > Analyzing one stuck host with drgn we were able to confirm that the
> > code is just waiting in this loop [3] for the NAPI_STATE_SCHED bit to
> > be cleared for the Wireguard peer napi instance, but that never
> > happens for some reason. For context the full state of the stuck napi
> > instance is 0b100110111. What makes things worse - this happens when
> > calico removes a Wireguard peer, which happens while holding the
> > global rtnl_mutex, so all the other tasks requiring that mutex get
> > stuck as well.
> >
> > Full stacktrace of the “looping” task:
> >
> > #0  context_switch (linux/kernel/sched/core.c:5380:2)
> > #1  __schedule (linux/kernel/sched/core.c:6698:8)
> > #2  schedule (linux/kernel/sched/core.c:6772:3)
> > #3  schedule_hrtimeout_range_clock (linux/kernel/time/hrtimer.c:2311:3)
> > #4  usleep_range_state (linux/kernel/time/timer.c:2363:8)
> > #5  usleep_range (linux/include/linux/delay.h:68:2)
> > #6  napi_disable (linux/net/core/dev.c:6477:4)
> > #7  peer_remove_after_dead (linux/drivers/net/wireguard/peer.c:120:2)
> > #8  set_peer (linux/drivers/net/wireguard/netlink.c:425:3)
> > #9  wg_set_device (linux/drivers/net/wireguard/netlink.c:592:10)
> > #10 genl_family_rcv_msg_doit (linux/net/netlink/genetlink.c:971:8)
> > #11 genl_family_rcv_msg (linux/net/netlink/genetlink.c:1051:10)
> > #12 genl_rcv_msg (linux/net/netlink/genetlink.c:1066:8)
> > #13 netlink_rcv_skb (linux/net/netlink/af_netlink.c:2545:9)
> > #14 genl_rcv (linux/net/netlink/genetlink.c:1075:2)
> > #15 netlink_unicast_kernel (linux/net/netlink/af_netlink.c:1342:3)
> > #16 netlink_unicast (linux/net/netlink/af_netlink.c:1368:10)
> > #17 netlink_sendmsg (linux/net/netlink/af_netlink.c:1910:8)
> > #18 sock_sendmsg_nosec (linux/net/socket.c:730:12)
> > #19 __sock_sendmsg (linux/net/socket.c:745:16)
> > #20 ____sys_sendmsg (linux/net/socket.c:2560:8)
> > #21 ___sys_sendmsg (linux/net/socket.c:2614:8)
> > #22 __sys_sendmsg (linux/net/socket.c:2643:8)
> > #23 do_syscall_x64 (linux/arch/x86/entry/common.c:51:14)
> > #24 do_syscall_64 (linux/arch/x86/entry/common.c:81:7)
> > #25 entry_SYSCALL_64+0x9c/0x184 (linux/arch/x86/entry/entry_64.S:121)
> >

Looking at this further to understand why the napi state would be 0b100110111
which decoded to: SCHED | MISSED | DISABLE | LISTED | NO_BUSY_POLL | THREADED,
I think this is a problem with MISSED wakeup in napi threaded mode.

napi_complete_done calls __napi_schedule when state has `NAPIF_STATE_MISSED`,
but in 6.6, it does not set NAPI_STATE_SCHED_THREADED when the napi thread is
running, and since the thread is running, wakeup does not do anything. Therefore
we missed the chance to do another poll, and if we race with napi_disable then
the state will be SCHED | MISSED | DISABLE, and we stuck.

However it looks like the following commit resolves that situation for us

commit 56364c910691f6d10ba88c964c9041b9ab777bd6
Author: Sebastian Andrzej Siewior <bigeasy@...utronix.de>
Date:   Mon Mar 25 08:40:28 2024 +0100

    net: Remove conditional threaded-NAPI wakeup based on task state.

As long as we always set SCHED_THREADED, the next loop in napi thread will do
the right thing and clear SCHED, letting napi_disable complete.

We are testing 6.6 with this patch, and haven't seen any similar lockups so far.

> > We have also noticed that a similar issue is observed, when we switch
> > Wireguard threaded NAPI back to off: removing a Wireguard peer task
> > may still spend a considerable amount of time in the above loop (and
> > hold rtnl_mutex), however the host eventually recovers from this
> > state.

However, these lockups become much more prominent now. Here is the
stack trace of
peer_remove_after_dead

#0  context_switch (/cfsetup_build/build/linux/kernel/sched/core.c:5380:2)
#1  __schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6699:8)
#2  schedule (/cfsetup_build/build/linux/kernel/sched/core.c:6773:3)
#3  schedule_timeout (/cfsetup_build/build/linux/kernel/time/timer.c:2143:3)
#4  do_wait_for_common
(/cfsetup_build/build/linux/kernel/sched/completion.c:95:14)
#5  __wait_for_common
(/cfsetup_build/build/linux/kernel/sched/completion.c:116:12)
#6  wait_for_common (/cfsetup_build/build/linux/kernel/sched/completion.c:127:9)
#7  wait_for_completion
(/cfsetup_build/build/linux/kernel/sched/completion.c:148:2)
#8  __flush_workqueue (/cfsetup_build/build/linux/kernel/workqueue.c:3196:2)
#9  peer_remove_after_dead
(/cfsetup_build/build/linux/drivers/net/wireguard/peer.c:116:2)
#10 set_peer (/cfsetup_build/build/linux/drivers/net/wireguard/netlink.c:425:3)
#11 wg_set_device
(/cfsetup_build/build/linux/drivers/net/wireguard/netlink.c:592:10)
#12 genl_family_rcv_msg_doit
(/cfsetup_build/build/linux/net/netlink/genetlink.c:971:8)
#13 genl_family_rcv_msg
(/cfsetup_build/build/linux/net/netlink/genetlink.c:1051:10)
#14 genl_rcv_msg (/cfsetup_build/build/linux/net/netlink/genetlink.c:1066:8)
#15 netlink_rcv_skb (/cfsetup_build/build/linux/net/netlink/af_netlink.c:2544:9)
#16 genl_rcv (/cfsetup_build/build/linux/net/netlink/genetlink.c:1075:2)
#17 netlink_unicast_kernel
(/cfsetup_build/build/linux/net/netlink/af_netlink.c:1342:3)
#18 netlink_unicast
(/cfsetup_build/build/linux/net/netlink/af_netlink.c:1368:10)
#19 netlink_sendmsg (/cfsetup_build/build/linux/net/netlink/af_netlink.c:1910:8)
#20 sock_sendmsg_nosec (/cfsetup_build/build/linux/net/socket.c:730:12)
#21 __sock_sendmsg (/cfsetup_build/build/linux/net/socket.c:745:16)
#22 ____sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2590:8)
#23 ___sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2644:8)
#24 __sys_sendmsg (/cfsetup_build/build/linux/net/socket.c:2673:8)
#25 do_syscall_x64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:51:14)
#26 do_syscall_64 (/cfsetup_build/build/linux/arch/x86/entry/common.c:81:7)
#27 entry_SYSCALL_64+0x9c/0x184
(/cfsetup_build/build/linux/arch/x86/entry/entry_64.S:121)
#28 0x41262e

drgn shows that we are waiting to for completion of work for
wg_packet_tx_worker,
which is destined for a completely different peer than the peer we
want to remove.

*(struct worker *)0xffff888107f640c0 = {
        .entry = (struct list_head){
                .next = (struct list_head *)0x0,
                .prev = (struct list_head *)0xffff8897e0cb1f50,
        },
        .hentry = (struct hlist_node){
                .next = (struct hlist_node *)0x0,
                .pprev = (struct hlist_node **)0xffff8897e0cb1f50,
        },
        .current_work = (struct work_struct *)0xffff8881a32638d0,
        .current_func = (work_func_t)wg_packet_tx_worker+0x0 =
0xffffffffc0f6ca40,
        .current_pwq = (struct pool_workqueue *)0xffff88812bca6400,
        .current_at = (u64)3491257913,
        .current_color = (unsigned int)4,
        .sleeping = (int)0,
        .last_func = (work_func_t)wg_packet_tx_worker+0x0 = 0xffffffffc0f6ca40,
        .scheduled = (struct list_head){
                .next = (struct list_head *)0xffff8881a32638d8,
                .prev = (struct list_head *)0xffff8881a32638d8,
        },
        .task = (struct task_struct *)0xffff888472a08000,
        .pool = (struct worker_pool *)0xffff8897e0cb1cc0,
        .node = (struct list_head){
                .next = (struct list_head *)0xffff888107f64360,
                .prev = (struct list_head *)0xffff888107f645a0,
        },
        .last_active = (unsigned long)4409382646,
        .flags = (unsigned int)64,
        .id = (int)1,
        .desc = (char [32])"wg-crypt-wireguard.cali",
        .rescue_wq = (struct workqueue_struct *)0x0,
}

This can take a very long time especially if the peers produce/receive
as fast as it
can, as in our test setup. We setup some metrics with bpftrace to measure the
duration of peer_remove_after_dead and wg_packet_tx_worker and got the following
measurements after a long wait time.

@duration_ms[peer_remove_after_dead]:
[512K, 1M)             1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
@duration_ms[wg_packet_tx_worker]:
[0]               744612 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1]                   20 |                                                    |
[2, 4)                 2 |                                                    |
[4, 8)                 2 |                                                    |
[16, 32)               2 |                                                    |
[256K, 512K)           1 |                                                    |
[2M, 4M)               1 |                                                    |

We can see that peer_remove_after_dead took between 512s to 1000s to complete,
while wg_packet_tx_worker can take up to [2000s, 4000s) to complete, which is an
awfully long time.

Daniel

> >
> > So the questions are:
> > 1. Any ideas why NAPI_STATE_SCHED bit never gets cleared for the
> > threaded NAPI case in Wireguard?
> > 2. Is it generally a good idea for Wireguard to loop for an
> > indeterminate amount of time, while holding the rtnl_mutex? Or can it
> > be refactored?
>
> I've been also trying to reproduce this issue with a script [1]. While
> I could not reproduce the complete lockup I've been able to confirm
> that peer_remove_after_dead() may take multiple seconds to execute -
> all while holding the rtnl_mutex. Below is bcc-tools funclatency
> output from a freshly compiled mainline (6.11):
>
> # /usr/share/bcc/tools/funclatency peer_remove_after_dead
> Tracing 1 functions for "peer_remove_after_dead"... Hit Ctrl-C to end.
> ^C
>                nsecs                         : count     distribution
>                    0 -> 1                    : 0        |                    |
>                    2 -> 3                    : 0        |                    |
>                    4 -> 7                    : 0        |                    |
>                    8 -> 15                   : 0        |                    |
>                   16 -> 31                   : 0        |                    |
>                   32 -> 63                   : 0        |                    |
>                   64 -> 127                  : 0        |                    |
>                  128 -> 255                  : 0        |                    |
>                  256 -> 511                  : 0        |                    |
>                  512 -> 1023                 : 0        |                    |
>                 1024 -> 2047                 : 0        |                    |
>                 2048 -> 4095                 : 0        |                    |
>                 4096 -> 8191                 : 0        |                    |
>                 8192 -> 16383                : 0        |                    |
>                16384 -> 32767                : 0        |                    |
>                32768 -> 65535                : 0        |                    |
>                65536 -> 131071               : 0        |                    |
>               131072 -> 262143               : 0        |                    |
>               262144 -> 524287               : 68       |**                  |
>               524288 -> 1048575              : 658      |********************|
>              1048576 -> 2097151              : 267      |********            |
>              2097152 -> 4194303              : 68       |**                  |
>              4194304 -> 8388607              : 124      |***                 |
>              8388608 -> 16777215             : 182      |*****               |
>             16777216 -> 33554431             : 72       |**                  |
>             33554432 -> 67108863             : 34       |*                   |
>             67108864 -> 134217727            : 22       |                    |
>            134217728 -> 268435455            : 11       |                    |
>            268435456 -> 536870911            : 2        |                    |
>            536870912 -> 1073741823           : 2        |                    |
>           1073741824 -> 2147483647           : 1        |                    |
>           2147483648 -> 4294967295           : 0        |                    |
>           4294967296 -> 8589934591           : 1        |                    |
>
> avg = 14251705 nsecs, total: 21548578415 nsecs, count: 1512
>
> Detaching...
>
> So we have cases where it takes 2 or even 8 seconds to remove a single
> peer, which is definitely not great considering we're holding a global
> lock.
>
> > We have observed the problem on Linux 6.6.47 and 6.6.48. We did try to
> > downgrade the kernel a couple of patch revisions, but it did not help
> > and our logs indicate that at least the non-threaded prolonged holding
> > of the rtnl_mutex is happening for a while now.
> >
> > [1]: https://docs.tigera.io/calico/latest/network-policy/encrypt-cluster-pod-traffic
> > [2]: https://docs.kernel.org/networking/napi.html#threaded
> > [3]: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/net/core/dev.c?h=v6.6.48#n6476
>
> Ignat
>
> [1]: https://gist.githubusercontent.com/ignatk/4505d96e02815de3aa5649c4aa7c3fca/raw/177e4eab9f491024db6488cd0ea1cbba2d5579b4/wg.sh