[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <d60c4cf1-8812-01e1-a51d-f1ada3c1b3fa@candelatech.com>
Date: Fri, 22 Apr 2022 14:09:02 -0700
From: Ben Greear <greearb@...delatech.com>
To: Florian Westphal <fw@...len.de>
Cc: netdev <netdev@...r.kernel.org>
Subject: Re: 5.10.4+ hang with 'rmmod nf_conntrack'
On 4/22/22 9:32 AM, Ben Greear wrote:
> On 1/8/21 5:07 AM, Ben Greear wrote:
>> On 1/7/21 10:16 PM, Florian Westphal wrote:
>>> Ben Greear <greearb@...delatech.com> wrote:
>>>> I noticed my system has a hung process trying to 'rmmod nf_conntrack'.
>>>>
>>>> I've generally been doing the script that calls rmmod forever,
>>>> but only extensively tested on 5.4 kernel and earlier.
>>>>
>>>> If anyone has any ideas, please let me know. This is from 'sysrq t'. I don't see
>>>> any hung-task splats in dmesg.
>>>
>>> rmmod on conntrack loops forever until the active conntrack object count reaches 0.
>>> (plus a walk of the conntrack table to evict/put all entries).
>
> Hello Florian,
>
> I keep hitting this bug in a particular test case in 5.17.4+, so I added some debug to
> try to learn more.
>
> My debugging patch looks like this:
>
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index 7552e1e9fd62..29724114caef 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -2543,6 +2543,7 @@ void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
> {
> int busy;
> struct net *net;
> + unsigned long loops = 0;
>
> /*
> * This makes sure all current packets have passed through
> @@ -2556,12 +2557,30 @@ void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
> struct nf_conntrack_net *cnet = nf_ct_pernet(net);
>
> nf_ct_iterate_cleanup(kill_all, net, 0, 0);
> - if (atomic_read(&cnet->count) != 0)
> + if (atomic_read(&cnet->count) != 0) {
> + if (loops > 50010)
> + pr_err("nf-conntrack-cleanup-net-list, loops: %ld cnet-count: %d, expect-count: %d users4: %d users6: %d users_bridge: %d\n",
> + loops, atomic_read(&cnet->count), cnet->expect_count,
> + cnet->users4, cnet->users6, cnet->users_bridge);
> busy = 1;
> + }
> }
> if (busy) {
> + loops++;
> + if (loops > 50000) {
> + msleep(500);
> + }
> schedule();
> - goto i_see_dead_people;
> + if (loops > 50020) {
> + /* This thing is wedged, going to require a reboot to recover, so attempt
> + * to just ignore the bad count and see if system works OK.
> + */
> + WARN_ON_ONCE(1);
> + pr_err("ERROR: nf_conntrack_cleanup_net cannot make progress. Ignoring stale reference count and will continue.\n");
> + }
> + else {
> + goto i_see_dead_people;
> + }
> }
>
> list_for_each_entry(net, net_exit_list, exit_list) {
>
>
> Do you (or anyone else), have some ideas for how to debug this further to help find where the reference
> is leaked (or not released)?
I am now quite sure that the problem I was seeing was caused by an skb leak in the mt76 driver
(for which Felix just found a solution). After that fix, then I no longer see the nf_conntrack
rmmod hangs. I will keep testing in case I am just geting (un)lucky.
I do plan to keep my hack/patch in my kernel though, I'd rather it continue and leak some more
memory instead of busy hang forever when skb leaks are hit...
Thanks,
Ben
Powered by blists - more mailing lists