[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <edb28fe5-cedb-8e63-88b2-122d3dfe3014@linux.vnet.ibm.com>
Date: Mon, 27 Nov 2017 21:44:07 -0500
From: Matthew Rosato <mjrosato@...ux.vnet.ibm.com>
To: Jason Wang <jasowang@...hat.com>, Wei Xu <wexu@...hat.com>
Cc: mst@...hat.com, netdev@...r.kernel.org, davem@...emloft.net
Subject: Re: Regression in throughput between kvm guests over virtual bridge
On 11/27/2017 08:36 PM, Jason Wang wrote:
>
>
> On 2017年11月28日 00:21, Wei Xu wrote:
>> On Mon, Nov 20, 2017 at 02:25:17PM -0500, Matthew Rosato wrote:
>>> On 11/14/2017 03:11 PM, Matthew Rosato wrote:
>>>> On 11/12/2017 01:34 PM, Wei Xu wrote:
>>>>> On Sat, Nov 11, 2017 at 03:59:54PM -0500, Matthew Rosato wrote:
>>>>>>>> This case should be quite similar with pkgten, if you got
>>>>>>>> improvement with
>>>>>>>> pktgen, usually it was also the same for UDP, could you please
>>>>>>>> try to disable
>>>>>>>> tso, gso, gro, ufo on all host tap devices and guest virtio-net
>>>>>>>> devices? Currently
>>>>>>>> the most significant tests would be like this AFAICT:
>>>>>>>>
>>>>>>>> Host->VM 4.12 4.13
>>>>>>>> TCP:
>>>>>>>> UDP:
>>>>>>>> pktgen:
>>> So, I automated these scenarios for extended overnight runs and started
>>> experiencing OOM conditions overnight on a 40G system. I did a bisect
>>> and it also points to c67df11f. I can see a leak in at least all of the
>>> Host->VM testcases (TCP, UDP, pktgen), but the pktgen scenario shows the
>>> fastest leak.
>>>
>>> I enabled slub_debug on base 4.13 and ran my pktgen scenario in short
>>> intervals until a large% of host memory was consumed. Numbers below
>>> after the last pktgen run completed. The summary is that a very large #
>>> of active skbuff_head_cache entries can be seen - The sum of alloc/free
>>> calls match up, but the # of active skbuff_head_cache entries keeps
>>> growing each time the workload is run and never goes back down in
>>> between runs.
>>>
>>> free -h:
>>> total used free shared buff/cache available
>>> Mem: 39G 31G 6.6G 472K 1.4G 6.8G
>>>
>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
>>>
>>> 1001952 1000610 99% 0.75K 23856 42 763392K
>>> skbuff_head_cache
>>> 126192 126153 99% 0.36K 2868 44 45888K ksm_rmap_item
>>> 100485 100435 99% 0.41K 1305 77 41760K kernfs_node_cache
>>> 63294 39598 62% 0.48K 959 66 30688K dentry
>>> 31968 31719 99% 0.88K 888 36 28416K inode_cache
>>>
>>> /sys/kernel/slab/skbuff_head_cache/alloc_calls :
>>> 259 __alloc_skb+0x68/0x188 age=1/135076/135741 pid=0-11776
>>> cpus=0,2,4,18
>>> 1000351 __build_skb+0x42/0xb0 age=8114/63172/117830 pid=0-11863
>>> cpus=0,10
>>>
>>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>> 13492 <not-available> age=4295073614 pid=0 cpus=0
>>> 978298 tun_do_read.part.10+0x18c/0x6a0 age=8532/63624/110571 pid=11733
>>> cpus=1-19
>>> 6 skb_free_datagram+0x32/0x78 age=11648/73253/110173 pid=11325
>>> cpus=4,8,10,12,14
>>> 3 __dev_kfree_skb_any+0x5e/0x70 age=108957/115043/118269
>>> pid=0-11605 cpus=5,7,12
>>> 1 netlink_broadcast_filtered+0x172/0x470 age=136165 pid=1 cpus=4
>>> 2 netlink_dump+0x268/0x2a8 age=73236/86857/100479 pid=11325
>>> cpus=4,12
>>> 1 netlink_unicast+0x1ae/0x220 age=12991 pid=9922 cpus=12
>>> 1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11776 cpus=6
>>> 3 unix_stream_read_generic+0x810/0x908 age=15443/50904/118273
>>> pid=9915-11581 cpus=8,16,18
>>> 2 tap_do_read+0x16a/0x488 [tap] age=42338/74246/106155
>>> pid=11605-11699 cpus=2,9
>>> 1 macvlan_process_broadcast+0x17e/0x1e0 [macvlan] age=18835
>>> pid=331 cpus=11
>>> 8800 pktgen_thread_worker+0x80a/0x16d8 [pktgen]
>>> age=8545/62184/110571
>>> pid=11863 cpus=0
>>>
>>>
>>> By comparison, when running 4.13 with c67df11f reverted, here's the same
>>> output after the exact same test:
>>>
>>> free -h:
>>> total used free shared buff/cache
>>> available
>>> Mem: 39G 783M 37G 472K 637M 37G
>>>
>>> slabtop:
>>> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
>>> 714 256 35% 0.75K 17 42 544K skbuff_head_cache
>>>
>>> /sys/kernel/slab/skbuff_head_cache/alloc_calls:
>>> 257 __alloc_skb+0x68/0x188 age=0/65252/65507 pid=1-11768 cpus=10,15
>>> /sys/kernel/slab/skbuff_head_cache/free_calls:
>>> 255 <not-available> age=4295003081 pid=0 cpus=0
>>> 1 netlink_broadcast_filtered+0x2e8/0x4e0 age=65601 pid=1 cpus=15
>>> 1 tcp_recvmsg+0x2e2/0xa60 age=0 pid=11768 cpus=16
>>>
>> Thanks a lot for the test, and sorry for the late update, I was
>> working on
>> the code path and didn't find anything helpful to you till today.
>>
>> I did some tests and initially it turned out that the bottleneck was
>> the guest
>> kernel stack(napi) side, followed by tracking the traffic footprints
>> and it
>> appeared as the loss happened when vring was full and could not be
>> drained
>> out by the guest, afterwards it triggered a SKB drop in vhost driver due
>> to no headcount to fill it with, it can be avoided by deferring
>> consuming the
>> SKB after having obtained a sufficient headcount with below patch.
>>
>> Could you please try it? It is based on 4.13 and I also applied Jason's
>> 'conditionally enable tx polling' patch.
>> https://lkml.org/lkml/2016/6/1/39
>
> This patch has already been merged.
>
>>
>> I only tested one instance case from Host -> VM with uperf & iperf3, I
>> like
>> iperf3 a bit more since it spontaneously tells the retransmitted and cwnd
>> during testing. :)
>>
>> To maximize the performance of one instance case, two vcpus are needed,
>> one does the kernel napi and the other one should serve the socket
>> syscall
>> (mostly reading) from uperf/iperf userspace, so I set two vcpus to the
>> guest
>> and pinned the iperf/uperf slave to the one not used by kernel napi,
>> you may
>> need to check out which one you should pin properly by seeing the CPU
>> utilization with a quick trial test before running the long duration
>> test.
>>
>> Slight performance improvement for tcp with the patch(host/guest
>> offload off)
>> on x86, also 4.12 wins the game with 20-30% possibility from time to
>> time, but
>> the cwnd and retransmitted statistics are almost the same now, the
>> 'retrans'
>> was about 10x times more and cwnd was 6x smaller than 4.12 before.
>>
>> Here is one typical sample of my tests.
>> 4.12 4.13
>> offload on: 36.8Gbits 37.4Gbits
>> offload off: 7.68Gbits 7.84Gbits
>>
>> I also borrowed a s390x machine with 6 cpus and 4G memory from system
>> z team,
>> it seems 4.12 is still a bit faster than 4.13, could you please see if
>> this
>> is aligned with your test bed?
>> 4.12 4.13
>> offload on: 37.3Gbits 38.3Gbits
>> offload off: 6.26Gbits 6.06Gbits
>>
>> For pktgen, I got 10% improvement(xdp1 drop on guest) which is a bit
>> faster
>> than Jason's number before.
>> 4.12 4.13
>> 3.33 Mpss 3.70 Mpps
>>
>> Thanks again for all the tests your have done.
>>
>> Wei
>>
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -776,8 +776,6 @@ static void handle_rx(struct vhost_net *net)
>> /* On error, stop handling until the next kick. */
>> if (unlikely(headcount < 0))
>> goto out;
>> - if (nvq->rx_array)
>> - msg.msg_control =
>> vhost_net_buf_consume(&nvq->rxq);
>> /* On overrun, truncate and discard */
>> if (unlikely(headcount > UIO_MAXIOV)) {
>
> I think you need do msg.msg_control = vhost_net_buf_consume() here too.
>
>> iov_iter_init(&msg.msg_iter, READ, vq->iov,
>> 1, 1);
>> @@ -798,6 +796,10 @@ static void handle_rx(struct vhost_net *net)
>> * they refilled. */
>> goto out;
>> }
>> +
>> + if (nvq->rx_array)
>> + msg.msg_control =
>> vhost_net_buf_consume(&nvq->rxq);
>> +
>> /* We don't need to be notified again. */
>> iov_iter_init(&msg.msg_iter, READ, vq->iov, in,
>> vhost_len);
>> fixup = msg.msg_iter;
>>
>>
>
> Good catch, this fixes the memory leak too.
>
> I suggest to post a formal patch for -net as soon as possible too since
> it was a valid fix even if it does not help for performance.
>> Thanks
>
+1 to posting this patch formally. I also verified that it resolves the
memory leak I was experiencing.
In terms of performance numbers, here are quick #s using the original
environment where the regression was noted (4GB, 4vcpu guests, no CPU
binding, TCP VM<->VM):
4.12: 34.71Gb/s
4.13: 18.80Gb/s
4.13+: 38.26Gb/s
I'll keep running numbers, but that looks very promising.
Powered by blists - more mailing lists