netdev - vhost_net: VM looses network when using vhost over time

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <872691802.5840849.1505918694826.JavaMail.zimbra@spreadshirt.net>
Date:   Wed, 20 Sep 2017 14:44:54 +0000 (UTC)
From:   Bernd Naumann <bena@...eadshirt.net>
To:     qemu-discuss@...gnu.org
Cc:     Linux Kernel Network Developers <netdev@...r.kernel.org>
Subject: vhost_net: VM looses network when using vhost over time

Hi @all,

We have encountered/experience a bug which is more or less reproducible, but we do not know how to do it exactly or how to debug the issue in the first place.


# Background

In our setup we have a Ganti Cluser (kvm) with atm ~60 nodes running ~500 VMs, we are using tap interfaces on L2 bridges, L3 routed tap interfaces, and tap interfaces on a bridge with a VTEP attached to it. (For the vxlan setup we have a home grown daemon to maintain the FDB).


# The issue

On some VMs we loose network-connectivity under certain/unknown circumstances. 
"Looseing" means that the VM is not reachable and can therefor not reach any other host in the network.

However with `tcpdump` on the host (phy NIC + bridge) we can see the traffic going in; but with `tcpdump` on the VM we only see arp goes in, but nothing goes out. Manually setting the ARP entry does not help at all, or only for a moment, like `ip link set $DEV set arp off; ip link set $DEV arp on`. The only way we found to "fix" it, is rebooting the VM, or do `modprobe -r virtio_net; modprobe virtio_net`, but this seams also not the best workaround and can fail in a short time again. Also it is difficult to determinate when the issue is kicking in. Counting 'FAILED' neighbors is a indicator but nothing to rely on.

The frequence of the issue ranges from once in a few days, to multiple times per day or even after some minutes after boot. Most impact we see on VMs with higher network traffic like our gateway-VMs (multiple NICs in different networks, IPsec, iptables, ...); ha-proxy-VMs (similar to our gateways), but also (with reduced frequency) on /normal/ application VMs.

For what we have found so far, it looks like kind of: 
* https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/997978 -- Bug #997978 “KVM images lose connectivity with bridged network” : Bugs : qemu-kvm package : Ubuntu
* https://bugs.centos.org/view.php?id=5526 -- 0005526: KVM Guest with virtio network loses network connectivity - CentOS Bug Tracker

Via `rtmon` we can observe that it starts with some "FAILED" neighbor entries and that they increase over time. As we know that this is only one consequence of not sending ARP replys to the requester; or that requested ARP is unanswered (cause the packet is not leaving the VM), the increasing count of 'FAILED' neighbors is /normal/. BUT: This can start on any interface, bridged tap interface for WAN, bridged tap in VXLAN, routed tap; it does not matter, or is not directly linked to the "kind" of interface.


# General overview of the setup

* ganiti-cluster with ~60 nodes
* each node has 2 x 50G (mlnx5 dual-port) connected to 2 x MLNX SN2700 switches
* each node runs `bird` with OSPF and ECMP (and OSPF with ECMP on SN2700 too)
* each VM has one or more vNICs in a bridged or routed network
* networks: bridged tap in WAN; bridged tap with attached VTEP; routed tap
* host OS: Ubuntu 16.04.3 with Ubuntu Kernel 4.12.13; first tested with qemu-kvm 1:2.5+dfsg-5ubuntu10.15, and later upgraded to qemu-kvm 2.10~rc3+dfsg-0ubuntu1, same issue; guest OS Ubutnu 14.04, Ubuntu 16.04 and Ubuntu 16.04 with latest Ubuntu mainline kernel PPA


# So far we can "verify" it is 'vhost'

Without "vhost=on" for the kvm process we can not observe this issue. While using "vhost=on", a effected VM can be "fixed" by `rmmod` and `insmod virtio_net`, but reboot seams to provide a "fix" for a "longer" period. (But as you may know, virtio has not the performance we expect.)


So we have some questions:

* How can we debug the main issue to provide a meaningful bug report? Debug flags on the kernel but where to hang gdb on it? Sadly we are no kernel hackers :/, but we can compile our own kernel and qemu-kvm to test also release candidates and/or put patches in place.
* Does someone have seen this too? Can provide a better workaround, or patch or anything?
* Where to file/reopen this issue? qemu, netdev?
* Is qemu-kvm even the right place to look for answers?

We are happy to provide more information or collect debug information if someone wants to investigate.

Thanks for your time!
Best,
Bernd Naumann

Spreadshirt 
Bernd Naumann 
Systems Engineer, Networking & Operations 
bernd.naumann@...eadshirt.net 

http://www.spreadshirt.com 

sprd.net AG 
Gießerstraße 27 
D-04229 Leipzig 

Fon: +49 341 594 00 - 5900 
Fax: +49 341 594 00 - 5149 

Vorstand / executive board: Philip Rooke (CEO/Vorsitzender) · Tobias Schaugg 
Aufsichtsratsvorsitzender / chairman of the supervisory board: Lukasz Gadowski 
Handelsregister / trade register: Amtsgericht Leipzig, HRB 22478 
Umsatzsteuer-IdentNummer / VAT-ID: DE 8138 7149 4