netdev - Re: INFO: task hung in ip6gre_exit_batch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <04b6ee08-7919-bf2d-5b77-bd346a0bff48@virtuozzo.com>
Date:   Thu, 7 Jun 2018 21:54:55 +0300
From:   Kirill Tkhai <ktkhai@...tuozzo.com>
To:     Dmitry Vyukov <dvyukov@...gle.com>
Cc:     syzbot <syzbot+bf78a74f82c1cf19069e@...kaller.appspotmail.com>,
        Christian Brauner <christian.brauner@...ntu.com>,
        David Miller <davem@...emloft.net>,
        David Ahern <dsahern@...il.com>,
        Florian Westphal <fw@...len.de>, Jiri Benc <jbenc@...hat.com>,
        LKML <linux-kernel@...r.kernel.org>,
        Xin Long <lucien.xin@...il.com>,
        mschiffer@...verse-factory.net, netdev <netdev@...r.kernel.org>,
        syzkaller-bugs <syzkaller-bugs@...glegroups.com>,
        Vladislav Yasevich <vyasevich@...il.com>
Subject: Re: INFO: task hung in ip6gre_exit_batch_net

On 07.06.2018 21:23, Dmitry Vyukov wrote:
> On Tue, Jun 5, 2018 at 3:55 PM, Kirill Tkhai <ktkhai@...tuozzo.com> wrote:
>> On 05.06.2018 12:36, Dmitry Vyukov wrote:
>>> On Tue, Jun 5, 2018 at 11:03 AM, Kirill Tkhai <ktkhai@...tuozzo.com> wrote:
>>>> Hi, Dmirty!
>>>>
>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>> <syzbot+bf78a74f82c1cf19069e@...kaller.appspotmail.com> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> syzbot found the following crash on:
>>>>>>
>>>>>> HEAD commit:    bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>> git tree:       upstream
>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>
>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>
>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@...kaller.appspotmail.com
>>>>>
>>>>> Another hang on rtnl lock:
>>>>>
>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>
>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>
>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>> be something different.
>>>
>>>
>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>
>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>
>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>> if they appear?
>>>
>>> Yes, everything containing "WARNING:" is detected as bug.
>>
>> OK, then dead task not releasing the lock is excluded.
>>
>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>> There may help a crash dump of the VM.
> 
> I can't find any legend for these +'s and .'s, but {+.+.} is present
> in large amounts in just any task hung report for different mutexes,
> so I would not expect that it means corruption.
> 
> Are dozens of known corruptions that syzkaller can trigger. But
> usually they are reliably caught by KASAN. If any of them would lead
> to silent memory corruption, we would got dozens of assorted crashes
> throughout the kernel. We've seen that at some points, but not
> recently. So I would assume that memory is not corrupted in all these
> cases:
> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545

This BUG clarifies the {+.+.}:

4 locks held by kworker/0:145/381:
 #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
 #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
 #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
 #0:  ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
 #1:  ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
 #2:  (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
 #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
 #3:  (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596

There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
since there is one more lock after it.

This BUG happen because of there are many rtnl_mutex waiters while owner
is synchronizing RCU. Rather clear for me in comparison to the topic's hung.
 
> I wonder if it can be just that slow, but not actually hanged... net
> namespace destruction is super slow, so perhaps under heavy load it
> all stalls for minutes...

Thanks,
Kirill