netdev - Re: Regression in throughput between kvm guests over virtual bridge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 13 Sep 2017 12:59:02 -0400
From:   Matthew Rosato <mjrosato@...ux.vnet.ibm.com>
To:     Jason Wang <jasowang@...hat.com>, netdev@...r.kernel.org
Cc:     davem@...emloft.net, mst@...hat.com
Subject: Re: Regression in throughput between kvm guests over virtual bridge

On 09/13/2017 04:13 AM, Jason Wang wrote:
> 
> 
> On 2017年09月13日 09:16, Jason Wang wrote:
>>
>>
>> On 2017年09月13日 01:56, Matthew Rosato wrote:
>>> We are seeing a regression for a subset of workloads across KVM guests
>>> over a virtual bridge between host kernel 4.12 and 4.13. Bisecting
>>> points to c67df11f "vhost_net: try batch dequing from skb array"
>>>
>>> In the regressed environment, we are running 4 kvm guests, 2 running as
>>> uperf servers and 2 running as uperf clients, all on a single host.
>>> They are connected via a virtual bridge.  The uperf client profile looks
>>> like:
>>>
>>> <?xml version="1.0"?>
>>> <profile name="TCP_STREAM">
>>>    <group nprocs="1">
>>>      <transaction iterations="1">
>>>        <flowop type="connect" options="remotehost=192.168.122.103
>>> protocol=tcp"/>
>>>      </transaction>
>>>      <transaction duration="300">
>>>        <flowop type="write" options="count=16 size=30000"/>
>>>      </transaction>
>>>      <transaction iterations="1">
>>>        <flowop type="disconnect"/>
>>>      </transaction>
>>>    </group>
>>> </profile>
>>>
>>> So, 1 tcp streaming instance per client.  When upgrading the host kernel
>>> from 4.12->4.13, we see about a 30% drop in throughput for this
>>> scenario.  After the bisect, I further verified that reverting c67df11f
>>> on 4.13 "fixes" the throughput for this scenario.
>>>
>>> On the other hand, if we increase the load by upping the number of
>>> streaming instances to 50 (nprocs="50") or even 10, we see instead a
>>> ~10% increase in throughput when upgrading host from 4.12->4.13.
>>>
>>> So it may be the issue is specific to "light load" scenarios.  I would
>>> expect some overhead for the batching, but 30% seems significant...  Any
>>> thoughts on what might be happening here?
>>>
>>
>> Hi, thanks for the bisecting. Will try to see if I can reproduce.
>> Various factors could have impact on stream performance. If possible,
>> could you collect the #pkts and average packet size during the test?
>> And if you guest version is above 4.12, could you please retry with
>> napi_tx=true?

Original runs were done with guest kernel 4.4 (from ubuntu 16.04.3 -
4.4.0-93-generic specifically).  Here's a throughput report (uperf) and
#pkts and average packet size (tcpstat) for one of the uperf clients:

host 4.12 / guest 4.4:
throughput: 29.98Gb/s
#pkts=33465571 avg packet size=33755.70

host 4.13 / guest 4.4:
throughput: 20.36Gb/s
#pkts=21233399 avg packet size=36130.69

I ran the test again using net-next.git as guest kernel, with and
without napi_tx=true.  napi_tx did not seem to have any significant
impact on throughput.  However, the guest kernel shift from
4.4->net-next improved things.  I can still see a regression between
host 4.12 and 4.13, but it's more on the order of 10-15% - another sample:

host 4.12 / guest net-next (without napi_tx):
throughput: 28.88Gb/s
#pkts=31743116 avg packet size=33779.78

host 4.13 / guest net-next (without napi_tx):
throughput: 24.34Gb/s
#pkts=25532724 avg packet size=35963.20

>>
>> Thanks
> 
> Unfortunately, I could not reproduce it locally. I'm using net-next.git
> as guest. I can get ~42Gb/s on Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
> for both before and after the commit. I use 1 vcpu and 1 queue, and pin
> vcpu and vhost threads into separate cpu on host manually (in same numa
> node).

The environment is quite a bit different -- I'm running in an LPAR on a
z13 (s390x).  We've seen the issue in various configurations, the
smallest thus far was a host partition w/ 40G and 20 CPUs defined (the
numbers above were gathered w/ this configuration).  Each guest has 4GB
and 4 vcpus.  No pinning / affinity configured.

> 
> Can you hit this regression constantly and what's you qemu command line

Yes, the regression seems consistent.  I can try tweaking some of the
host and guest definitions to see if it makes a difference.

The guests are instantiated from libvirt - Here's one of the resulting
qemu command lines:

/usr/bin/qemu-system-s390x -name guest=mjrs34g1,debug-threads=on -S
-object
secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-mjrs34g1/master-key.aes
-machine s390-ccw-virtio-2.10,accel=kvm,usb=off,dump-guest-core=off -m
4096 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid
44710587-e783-4bd8-8590-55ff421431b1 -display none -no-user-config
-nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-1-mjrs34g1/monitor.sock,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot strict=on -drive
file=/dev/disk/by-id/scsi-3600507630bffc0380000000000001803,format=raw,if=none,id=drive-virtio-disk0
-device
virtio-blk-ccw,scsi=off,devno=fe.0.0000,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=27 -device
virtio-net-ccw,netdev=hostnet0,id=net0,mac=02:de:26:53:14:01,devno=fe.0.0001
-netdev tap,fd=28,id=hostnet1,vhost=on,vhostfd=29 -device
virtio-net-ccw,netdev=hostnet1,id=net1,mac=02:54:00:89:d4:01,devno=fe.0.00a1
-chardev pty,id=charconsole0 -device
sclpconsole,chardev=charconsole0,id=console0 -device
virtio-balloon-ccw,id=balloon0,devno=fe.0.0002 -msg timestamp=on

In the above, net0 is used for a macvtap connection (not used in the
experiment, just for a reliable ssh connection - can remove if needed).
net1 is the bridge connection used for the uperf tests.


> and #cpus on host? Is zerocopy enabled?

Host info provided above.

cat /sys/module/vhost_net/parameters/experimental_zcopytx
1