netdev - Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread patch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 05 Apr 2012 08:22:07 -0700
From:	Shirley Ma <mashirle@...ibm.com>
To:	"Michael S. Tsirkin" <mst@...hat.com>
Cc:	Jason Wang <jasowang@...hat.com>, netdev@...r.kernel.org,
	kvm@...r.kernel.org, tahm@...ux.vnet.ibm.com, vivek@...ibm.com
Subject: Re: [RFC PATCH 0/1] NUMA aware scheduling per cpu vhost thread
 patch

On Thu, 2012-04-05 at 15:28 +0300, Michael S. Tsirkin wrote:
> On Tue, Mar 27, 2012 at 10:43:03AM -0700, Shirley Ma wrote:
> > On Tue, 2012-03-27 at 18:09 +0800, Jason Wang wrote:
> > > Hi:
> > > 
> > > Thanks for the work and it looks very reasonable, some questions
> > > below.
> 
> Yes I am happy to see the per-cpu work resurrected.
> Some comments below.
Glad to see you have time on reviewing this.

> > > On 03/23/2012 07:48 AM, Shirley Ma wrote:
> > > > Sorry for being late to submit this patch. I have spent lots of
> time
> > > > trying to find the best approach. This effort is still going
> on...
> > > >
> > > > This patch is built against net-next tree.
> > > >
> > > > This is an experimental RFC patch. The purpose of this patch is
> to
> > > > address KVM networking scalability and NUMA scheduling issue.
> > > 
> > > Need also test for non-NUMA machine, I see that you just choose
> the
> > > cpu 
> > > that initiates the work for non-numa machine which seems sub
> optimal.
> > 
> > Good suggestions. I don't have any non-numa systems. But KK run some
> > tests on non-numa system. He could see around 20% performance gain
> for
> > single VMs local host to guest. I hope we can run a full test on
> > non-numa system.
> > 
> > On non-numa system, the same per vhost-cpu thread will be always
> picked
> > up consistently for a particular vq since all cores are on same cpu
> > socket. So there will be two per-cpu vhost threads handle TX/RX
> > simultaneously.
> > 
> > > > The existing implementation of vhost creats a vhost thread
> > > per-device
> > > > (virtio_net) based. RX and TX work of a VMs per-device is
> handled by
> > > > same vhost thread.
> > > >
> > > > One of the limitation of this implementation is with increasing
> the
> > > > number VMs or the number of virtio-net interfces, more vhost
> threads
> > > are
> > > > created, it will consume more kernel resources, and induce more
> > > threads
> > > > context switches/scheduling overhead. We noticed that the KVM
> > > network
> > > > performance doesn't scale with increasing number of VMs.
> > > >
> > > > The other limitation is to have single vhost thread to process
> both
> > > RX
> > > > and TX, the work will be blocked. So we create this per cpu
> vhost
> > > thread
> > > > implementation. The number of vhost cpu threads is limited to
> the
> > > number
> > > > of cpus on the host.
> > > >
> > > > To address these limitations, we are propsing a per-cpu vhost
> thread
> > > > model where the number of vhost threads are limited and equal to
> the
> > > > number of online cpus on the host.
> > > 
> > > The number of vhost thread needs more consideration. Consider that
> we 
> > > have a 1024 cores host with a card have 16 tx/rx queues, do we
> really 
> > > need 1024 vhost threads?
> > 
> > In this case, we could add a module parameter to limit the number of
> > cores/sockets to be used.
> 
> Hmm. And then which cores would we run on?
> Also, is the parameter different between guests?
> Another idea is to scale the # of threads on demand.

If we are able to pass number of guests/vcpus info to vhost, we can
scale the vhost threads. Any API to get this info?


> Sharing the same thread between guests is also an
> interesting approach, if we did this then per-cpu
> won't be so expensive but making this work well
> with cgroups would be a challenge.

Yes, I am comparing vhost thread pool to share among guests approach
with per-cpu vhost approach now.

It's challenge to work with cgroups anyway.

> 
> > > >
> > > > Based on our testing experience, the vcpus can be scheduled
> across
> > > cpu
> > > > sockets even when the number of vcpus is smaller than the number
> of
> > > > cores per cpu socket and there is no other  activities besides
> KVM
> > > > networking workload. We found that if vhost thread is scheduled
> on
> > > the
> > > > same socket as the work is received, the performance will be
> better.
> > > >
> > > > So in this per cpu vhost thread implementation, a vhost thread
> is
> > > > selected dynamically based on where the TX/RX work is initiated.
> A
> > > vhost
> > > > thread on the same cpu socket is selected but not on the same
> cpu as
> > > the
> > > > vcpu/interrupt thread that initizated the TX/RX work.
> > > >
> > > > When we test this RFC patch, the other interesting thing we
> found is
> > > the
> > > > performance results also seem related to NIC flow steering. We
> are
> > > > spending time on evaluate different NICs flow director
> > > implementation
> > > > now. We will enhance this patch based on our findings later.
> > > >
> > > > We have tried different scheduling: per-device based, per vq
> based
> > > and
> > > > per work type (tx_kick, rx_kick, tx_net, rx_net) based vhost
> > > scheduling,
> > > > we found that so far the per vq based scheduling is good enough
> for
> > > now.
> > > 
> > > Could you please explain more about those scheduling strategies?
> Does 
> > > per-device based means let a dedicated vhost thread to handle all
> > > work 
> > > from that vhost device? As you mentioned, maybe an improvement of
> the 
> > > scheduling to take flow steering info (queue mapping, rxhash etc.)
> of 
> > > skb in host into account.
> > 
> > Yes, per-device scheduling means one per-cpu vhost theads handle all
> > works from one particular vhost-device.
> > 
> > Yes, we think scheduling to take flow steering info would help
> > performance. I am studying this now.
> 
> Did anything interesing turn up?

Not yet, still investigating.

> 
> > > >
> > > > We also tried different algorithm to select which cpu vhost
> thread
> > > will
> > > > running on a specific cpu socket: avg_load balance, and
> randomly...
> > > 
> > > May worth to account the out-of-oder packet during the test as for
> a 
> > > single stream as different cpu/vhost/physical queue may be chose
> to
> > > do 
> > > the packet transmission/reception?
> > 
> > Good point. I haven't gone through all data yet. netstat output
> might
> > tell us something.
> > 
> > We used Intel 10G NIC to run all test. For a single steam test,
> Intel
> > NIC receiving irq steers with same irq/queue which TX packets have
> been
> > sent. So when we mask vcpus from same VM on one socket, we shouldn't
> hit
> > packet out-of-order case. We might hit packet out of order when
> vcpus
> > run across sockets.
> > 
> > > >
> > > > > From our test results, we found that the scalability has been
> > > > significantly improved. And this patch is also helpful for small
> > > packets
> > > > performance.
> > > >
> > > > Hoever, we are seeing some regressions in a local guest to guest
> > > > scenario on a 8 cpu NUMA system.
> > > > In one case, 24 VMs 256 bytes tcp_stream test shows it has
> improved
> > > from
> > > > 810Mb/s to 9.1Gb/s. :)
> > > > (We created two local VMs, and each VM has 2 vcpus. W/o this
> patch,
> > > the
> > > > number of threads is 4 vcpus + 2 vhosts = 6, w/i this patch is 4
> > > vcpus +
> > > > 8 vhosts = 12. It causes more context switches. When I change
> the
> > > > scheduling to use 2-4 vhost threads, the regressions are gone. I
> am
> > > > continue investigation on how to make small number of VMs, local
> > > guest
> > > > to gues performance better. Once I find the clue, I will share
> > > here.)
> 
> So, that's one obvious reason. But there could be other explanations:
> 1. You explicitly mask out the same CPU. But if the socket
>    is very small (it's likely each socket is 2 CPUs or even 1 here),
>    this might limit the scheduler drastically.
Only if we limit guest vcpus on same socket. The default host schedules
vcpus across sockets.

> 2. If guest ends up running on the same socket, you cause
>    more IPIs which cause exists for the other guest.
I used different approaches to schedule vhost thread: 1. check loadavg
on a particular cpu; 2. randomly pick up a cpu, the performance didn't
make much difference in a small amount of VMs. 

On Tom's 1-24 VMs scalability test, it had impressive results when
amount of VMs are increased compared to existing approach. So it might
not be a big issue.

> > > >
> > > > The cpu hotplug support hasn't in place yet. I will post it
> later.
> 
> Not yet done, right?

Done now, under testing.

> > > Another question is why not just using workqueue? It has full
> support 
> > > for cpu hotplug and allow more polices.
> > 
> > Yes, it's good to use workqueue. I just did everything on top of
> current
> > implementation so it's easy to compare/analyze the performance data.
> > 
> > I remembered the vhost implementation changed from workqueue to
> thread
> > for some reason. I couldn't recall the reason.
> 
> At the time the implementation didn't perform well with per-cpu
> threads. We wanted a single thread so switched to use just that.
> 
> > > >
> > > > Since we have per cpu vhost thread, each vhost thread will
> handle
> > > > multiple vqs, so we will be able to reduce/remove vq
> notification
> > > when
> > > > the work is heavy loaded in future.
> > > 
> > > Does this issue still exist if event index is used? If vhost does
> not 
> > > publish new used index, guest would not kick again.
> > 
> > Since the vhost model has been changed to handle multiple VMs' vqs
> work,
> > then it's not necessary to enable these VMs' vqs notification
> (published
> > new used idex) where these vqs' future work will be processed on the
> > same per-cpu vhost thread, as long as the per-cpu vhost thread is
> still
> > running.
> > 
> > > >
> > > > Here is my test results for remote host to guest test: tcp_rrs,
> > > udp_rrs,
> > > > tcp_stream with guest has 2 vpus, host has two cpu socket, each
> > > socket
> > > > has 4 cores.
> > > >
> > > > TCP_STREAM    256     512     1K      2K      4K      8K
> 16K
> > > >
> --------------------------------------------------------------------
> > > > Original
> > > >
> > > H->Guest      2501    4238    4744    5256    7203    6975    5799
> Patch
> > > >
> > > H->Guest      1676    2290    3149    8026    8439    8283
> 8216    
> > > >                                                               
> > > > Original
> > > >
> > > Guest->H      744     1773    5675    1397    8207    7296
> 8117    
> > > > Patch
> > > > Guest->Host   1041    1386    5407    7057    8298    8127
> 8241
> > > 
> > > Looks like there's some noise in the result, the throughput of
> > > "original 
> > > guest -> Host 2K" looks too low. And some strange is that I see 
> > > regressions of packet transmission of guest when testing this
> patch.
> > > ( 
> > > Guest to Local Host TCP_STREAM in a NUMA machine).
> > 
> > Yes, since I didn't mask the vcpus on the same socket, it might come
> > from packets out of order. I will rerun the test w/i masking vcpus
> on
> > the same socket to see any difference.
> 
> Did anything interesting turn up?

Haven't had time to focus on single stream result yet.

> 
> > You can reference Tom's results. His test is more formal than mine.
> > 
> > > >
> > > > 60 instances TCP_RRs: Patch 150K trans/s vs. 91K trans/sec
> > > > 65%  improved with taskset vcpus on the same socket
> > > > 60 instances UDP_RRs: Patch 172K trans/s vs. 103K trans/s
> > > > 67%  improved with taskset vcpus on the same socket
> > > >
> > > > Tom has run 1VM to 24 VMs test for different work. He will post
> it
> > > here
> > > > soon.
> > > >
> > > > If the host scheduler ensures that the VM's vcpus are not
> scheduled
> > > to
> > > > another socket (i.e. cpu mask the vcpus on same socket) then the
> > > > performance will be better.
> > > >
> > > > Signed-off-by: Shirley Ma<xma@...ibm.com>
> > > > Signed-off-by: Krishna Kumar<krkumar2@...ibm.com>
> > > > Tested-by: Tom Lendacky<toml@...ibm.com>
> > > > ---
> > > >
> > > >   drivers/vhost/net.c                  |   26 ++-
> > > >   drivers/vhost/vhost.c                |  289
> > > > +++++++++++++++++++++++----------
> > > >   drivers/vhost/vhost.h                |   16 ++-
> > > >   3 files changed, 232 insertions(+), 103 deletions(-)
> > > >
> > > > Thanks
> > > > Shirley
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe
> netdev" in
> > > > the body of a message to majordomo@...r.kernel.org
> > > > More majordomo info at
> http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> Also a question: how does this interact with zero copy tx? 

Yes, I tested this with zero copy tx. The vhost thread which handles tx
work has been significantly reduced.

Thanks
Shirley

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html