lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <87vakgtnl2.fsf@vitty.brq.redhat.com>
Date:   Mon, 18 Sep 2017 10:31:21 +0200
From:   Vitaly Kuznetsov <vkuznets@...hat.com>
To:     Stephen Hemminger <stephen@...workplumber.org>
Cc:     KY Srinivasan <kys@...rosoft.com>, leann.ogasawara@...onical.com,
        Stephen Hemminger <sthemmin@...rosoft.com>,
        "apw\@canonical.com" <apw@...onical.com>,
        "olaf\@aepfle.de" <olaf@...fle.de>,
        "marcelo.cerri\@canonical.com" <marcelo.cerri@...onical.com>,
        "gregkh\@linuxfoundation.org" <gregkh@...uxfoundation.org>,
        Haiyang Zhang <haiyangz@...rosoft.com>,
        "linux-kernel\@vger.kernel.org" <linux-kernel@...r.kernel.org>,
        "jasowang\@redhat.com" <jasowang@...hat.com>,
        "devel\@linuxdriverproject.org" <devel@...uxdriverproject.org>
Subject: Re: [PATCH 1/1] Drivers: hv: vmbus: Fix rescind handling issues

Stephen Hemminger <stephen@...workplumber.org> writes:

> On Sep 15, 2017 11:01 AM, "KY Srinivasan" <kys@...rosoft.com> wrote:
>
>  > Vitaly Kuznetsov <vkuznets@...hat.com> writes:
>  >
>  > >
>  > > I'm seeing the same issue, reverting the offending
>  > >
>  > > commit 6f3d791f300618caf82a2be0c27456edd76d5164
>  > > Author: K. Y. Srinivasan <kys@...rosoft.com>
>  > > Date: Fri Aug 11 10:03:59 2017 -0700
>  > >
>  > > Drivers: hv: vmbus: Fix rescind handling issues
>  > >
>  > > which made it upstream helps. Did you guys do some investigation here?
>  > > In case not I can take a look next week.
>  >
>  > Sorry, I have to resend - K.Y.'s kys@...hange.microsoft.com doesn't
>  > accept mail from me :-(
>
>  I think this turned out to be a bug in netvsc - Stephen can elaborate. I think the fix
>  has been submitted upstream as well.
>
>  K. Y
>  >
>  > --
>  > Vitaly
>
> It turned out that subchannel call back is run from primary channel call back, and was deadlocking waiting for vmbus open.
>
> Original code had broken wait for sub channels.
> The first for that induced this bug.
>
> The resolution is in latest network driver was to bring up sub channels from work queue

This is a bit confusing, in case you mean

commit 8195b1396ec86dddbba443c74b2188b423556c74
Author: Stephen Hemminger <stephen@...workplumber.org>
Date:   Wed Sep 6 13:53:05 2017 -0700

    hv_netvsc: fix deadlock on hotplug

is supposed to fix the issue. I'm testing the latest net-next which has
it:

$ git log --oneline drivers/net/hyperv/
5023a6db7319 netvsc: increase default receive buffer size
8f2bb1de7334 hv_netvsc: avoid unnecessary wakeups on subchannel creation
8195b1396ec8 hv_netvsc: fix deadlock on hotplug
db3cd7af9d0f hv_netvsc: Fix the channel limit in netvsc_set_rxfh()
06be580ac7b6 hv_netvsc: Simplify the limit check in netvsc_set_channels()
...

and when I do

# ip link set dev eth0 mtu 1400           
# ip link set dev eth0 mtu 1500           
(actually, you can do mtu change just once the deadlock has already
happened, doing it twice just reveals the issue faster - it will hang
permanently trying to get rtnl lock which is already taken)

The log is:

[  248.800089] INFO: task kworker/u480:0:5 blocked for more than 120 seconds.
[  248.804065]       Not tainted 4.14.0-rc1 #63
[  248.806486] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  248.810879] kworker/u480:0  D    0     5      2 0x80000000
[  248.814225] Workqueue: netns cleanup_net
[  248.816745] Call Trace:
[  248.818286]  __schedule+0x1d5/0x790
[  248.820398]  ? sched_clock+0x9/0x10
[  248.822297]  ? select_idle_sibling+0x24/0x420
[  248.824883]  schedule+0x31/0x80
[  248.826835]  schedule_preempt_disabled+0x9/0x10
[  248.829444]  __mutex_lock.isra.1+0x1a3/0x4f0
[  248.831973]  ? sched_clock_cpu+0x5d/0xb0
[  248.834300]  __mutex_lock_slowpath+0xe/0x10
[  248.836869]  ? __mutex_lock_slowpath+0xe/0x10
[  248.839488]  mutex_lock+0x2a/0x30
[  248.841378]  rtnl_lock+0x10/0x20
[  248.843431]  cleanup_net+0x94/0x2e0
[  248.845479]  process_one_work+0x193/0x390
[  248.847838]  worker_thread+0x48/0x3c0
[  248.850121]  kthread+0x120/0x140
[  248.851927]  ? process_one_work+0x390/0x390
[  248.854425]  ? kthread_create_on_node+0x60/0x60
[  248.856884]  ? kthread_create_on_node+0x60/0x60
[  248.859585]  ret_from_fork+0x25/0x30
[  248.861581] INFO: task kworker/9:2:374 blocked for more than 120 seconds.
[  248.865286]       Not tainted 4.14.0-rc1 #63
[  248.867655] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  248.872135] kworker/9:2     D    0   374      2 0x80000000
[  248.875175] Workqueue: events rndis_set_subchannel [hv_netvsc]
[  248.878473] Call Trace:
[  248.879825]  __schedule+0x1d5/0x790
[  248.882122]  schedule+0x31/0x80
[  248.884160]  rndis_set_subchannel+0x186/0x1e0 [hv_netvsc]
[  248.887741]  ? finish_wait+0x80/0x80
[  248.889960]  process_one_work+0x193/0x390
[  248.892349]  worker_thread+0x48/0x3c0
[  248.894484]  kthread+0x120/0x140
[  248.896364]  ? process_one_work+0x390/0x390
[  248.898756]  ? kthread_create_on_node+0x60/0x60
[  248.901229]  ret_from_fork+0x25/0x30
[  248.903180] INFO: task hypervkvpd:656 blocked for more than 120 seconds.
[  248.907120]       Not tainted 4.14.0-rc1 #63
[  248.909966] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  248.914387] hypervkvpd      D    0   656      1 0x00000000
[  248.917618] Call Trace:
[  248.919119]  __schedule+0x1d5/0x790
[  248.921343]  schedule+0x31/0x80
[  248.924454]  schedule_preempt_disabled+0x9/0x10
[  248.927792]  __mutex_lock.isra.1+0x1a3/0x4f0
[  248.930951]  __mutex_lock_slowpath+0xe/0x10
[  248.934078]  ? __mutex_lock_slowpath+0xe/0x10
[  248.937466]  mutex_lock+0x2a/0x30
[  248.940228]  __netlink_dump_start+0x44/0x190
[  248.943286]  rtnetlink_rcv_msg+0x1a2/0x280
[  248.946237]  ? vsnprintf+0xea/0x4d0
[  248.948888]  ? rtnl_getlink+0x1c0/0x1c0
[  248.951713]  ? rtnl_getlink+0x1c0/0x1c0
[  248.954644]  ? rtnl_calcit.isra.26+0x110/0x110
[  248.958010]  netlink_rcv_skb+0x89/0x130
[  248.960945]  rtnetlink_rcv+0x10/0x20
[  248.963566]  netlink_unicast+0x186/0x220
[  248.966442]  netlink_sendmsg+0x2a8/0x3a0
[  248.969369]  sock_sendmsg+0x33/0x40
[  248.971955]  SYSC_sendto+0x13f/0x180
[  248.974552]  ? SYSC_getsockname+0xc7/0xe0
[  248.977393]  ? fd_install+0x20/0x30
[  248.980044]  ? sock_map_fd+0x3f/0x60
[  248.982995]  SyS_sendto+0x9/0x10
[  248.986106]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[  248.989785] RIP: 0033:0x7f544421f0d3
[  248.992576] RSP: 002b:00007ffcb8a81dc8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  248.997458] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f544421f0d3
[  249.002140] RDX: 0000000000000014 RSI: 00007ffcb8a82e20 RDI: 000000000000000a
[  249.006863] RBP: 00007ffcb8a84e50 R08: 00007ffcb8a82e00 R09: 000000000000000c
[  249.011590] R10: 0000000000000000 R11: 0000000000000246 R12: 00005598e3c4d963
[  249.016558] R13: 00005598e3bb1110 R14: 00005598e3c4d8c0 R15: 0000000000000011
[  249.021481] INFO: task ip:1004 blocked for more than 120 seconds.
[  249.025703]       Not tainted 4.14.0-rc1 #63
[  249.028713] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  249.039417] ip              D    0  1004    938 0x00000000
[  249.043418] Call Trace:
[  249.045633]  __schedule+0x1d5/0x790
[  249.048335]  schedule+0x31/0x80
[  249.050795]  schedule_preempt_disabled+0x9/0x10
[  249.054109]  __mutex_lock.isra.1+0x1a3/0x4f0
[  249.057251]  ? security_capable+0x43/0x60
[  249.060188]  __mutex_lock_slowpath+0xe/0x10
[  249.063299]  ? __netlink_ns_capable+0x36/0x40
[  249.066416]  ? __mutex_lock_slowpath+0xe/0x10
[  249.069520]  mutex_lock+0x2a/0x30
[  249.072032]  rtnetlink_rcv_msg+0x1cb/0x280
[  249.075189]  ? rtnl_calcit.isra.26+0x110/0x110
[  249.079047]  netlink_rcv_skb+0x89/0x130
[  249.082209]  rtnetlink_rcv+0x10/0x20
[  249.085369]  netlink_unicast+0x186/0x220
[  249.088767]  netlink_sendmsg+0x2a8/0x3a0
[  249.091777]  sock_sendmsg+0x33/0x40
[  249.094353]  SYSC_sendto+0x13f/0x180
[  249.096938]  ? handle_mm_fault+0x65/0xf0
[  249.099824]  ? __do_page_fault+0x250/0x4a0
[  249.102817]  SyS_sendto+0x9/0x10
[  249.105240]  entry_SYSCALL_64_fastpath+0x1a/0xa5
[  249.108489] RIP: 0033:0x7f972cc28f7d
[  249.111176] RSP: 002b:00007ffed31097e8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[  249.115958] RAX: ffffffffffffffda RBX: 000000000066fc90 RCX: 00007f972cc28f7d
[  249.120548] RDX: 0000000000000020 RSI: 00007ffed3109800 RDI: 0000000000000003
[  249.125439] RBP: 0000000000000005 R08: 0000000000000000 R09: 0000000000000000
[  249.130397] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
[  249.135432] R13: 00007ffed3109ed0 R14: 00007ffed3109ed8 R15: 00007ffed310ae1f

Reverting 6f3d791f300618caf82a2be0c27456edd76d5164 still helps. Or do
you mean some other fix which is not yet in net-next?

-- 
  Vitaly

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ