[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20101005182323.GA25852@redhat.com>
Date: Tue, 5 Oct 2010 20:23:23 +0200
From: "Michael S. Tsirkin" <mst@...hat.com>
To: Krishna Kumar2 <krkumar2@...ibm.com>
Cc: anthony@...emonkey.ws, arnd@...db.de, avi@...hat.com,
davem@...emloft.net, kvm@...r.kernel.org, netdev@...r.kernel.org,
rusty@...tcorp.com.au
Subject: Re: [v2 RFC PATCH 0/4] Implement multiqueue virtio-net
On Tue, Oct 05, 2010 at 04:10:00PM +0530, Krishna Kumar2 wrote:
> "Michael S. Tsirkin" <mst@...hat.com> wrote on 09/19/2010 06:14:43 PM:
>
> > Could you document how exactly do you measure multistream bandwidth:
> > netperf flags, etc?
>
> All results were without any netperf flags or system tuning:
> for i in $list
> do
> netperf -c -C -l 60 -H 192.168.122.1 > /tmp/netperf.$$.$i &
> done
> wait
> Another script processes the result files. It also displays the
> start time/end time of each iteration to make sure skew due to
> parallel netperfs is minimal.
>
> I changed the vhost functionality once more to try to get the
> best model, the new model being:
> 1. #numtxqs=1 -> #vhosts=1, this thread handles both RX/TX.
> 2. #numtxqs>1 -> vhost[0] handles RX and vhost[1-MAX] handles
> TX[0-n], where MAX is 4. Beyond numtxqs=4, the remaining TX
> queues are handled by vhost threads in round-robin fashion.
>
> Results from here on are with these changes, and only "tuning" is
> to set each vhost's affinity to CPUs[0-3] ("taskset -p f <vhost-pids>").
>
> > Any idea where does this come from?
> > Do you see more TX interrupts? RX interrupts? Exits?
> > Do interrupts bounce more between guest CPUs?
> > 4. Identify reasons for single netperf BW regression.
>
> After testing various combinations of #txqs, #vhosts, #netperf
> sessions, I think the drop for 1 stream is due to TX and RX for
> a flow being processed on different cpus.
Right. Can we fix it?
> I did two more tests:
> 1. Pin vhosts to same CPU:
> - BW drop is much lower for 1 stream case (- 5 to -8% range)
> - But performance is not so high for more sessions.
> 2. Changed vhost to be single threaded:
> - No degradation for 1 session, and improvement for upto
> 8, sometimes 16 streams (5-12%).
> - BW degrades after that, all the way till 128 netperf sessions.
> - But overall CPU utilization improves.
> Summary of the entire run (for 1-128 sessions):
> txq=4: BW: (-2.3) CPU: (-16.5) RCPU: (-5.3)
> txq=16: BW: (-1.9) CPU: (-24.9) RCPU: (-9.6)
>
> I don't see any reasons mentioned above. However, for higher
> number of netperf sessions, I see a big increase in retransmissions:
Hmm, ok, and do you see any errors?
> _______________________________________
> #netperf ORG NEW
> BW (#retr) BW (#retr)
> _______________________________________
> 1 70244 (0) 64102 (0)
> 4 21421 (0) 36570 (416)
> 8 21746 (0) 38604 (148)
> 16 21783 (0) 40632 (464)
> 32 22677 (0) 37163 (1053)
> 64 23648 (4) 36449 (2197)
> 128 23251 (2) 31676 (3185)
> _______________________________________
>
> Single netperf case didn't have any retransmissions so that is not
> the cause for drop. I tested ixgbe (MQ):
> ___________________________________________________________
> #netperf ixgbe ixgbe (pin intrs to cpu#0 on
> both server/client)
> BW (#retr) BW (#retr)
> ___________________________________________________________
> 1 3567 (117) 6000 (251)
> 2 4406 (477) 6298 (725)
> 4 6119 (1085) 7208 (3387)
> 8 6595 (4276) 7381 (15296)
> 16 6651 (11651) 6856 (30394)
Interesting.
You are saying we get much more retransmissions with physical nic as
well?
> ___________________________________________________________
>
> > 5. Test perf in more scenarious:
> > small packets
>
> 512 byte packets - BW drop for upto 8 (sometimes 16) netperf sessions,
> but increases with #sessions:
> _______________________________________________________________________________
> # BW1 BW2 (%) CPU1 CPU2 (%) RCPU1 RCPU2 (%)
> _______________________________________________________________________________
> 1 4043 3800 (-6.0) 50 50 (0) 86 98 (13.9)
> 2 8358 7485 (-10.4) 153 178 (16.3) 230 264 (14.7)
> 4 20664 13567 (-34.3) 448 490 (9.3) 530 624 (17.7)
> 8 25198 17590 (-30.1) 967 1021 (5.5) 1085 1257 (15.8)
> 16 23791 24057 (1.1) 1904 2220 (16.5) 2156 2578 (19.5)
> 24 23055 26378 (14.4) 2807 3378 (20.3) 3225 3901 (20.9)
> 32 22873 27116 (18.5) 3748 4525 (20.7) 4307 5239 (21.6)
> 40 22876 29106 (27.2) 4705 5717 (21.5) 5388 6591 (22.3)
> 48 23099 31352 (35.7) 5642 6986 (23.8) 6475 8085 (24.8)
> 64 22645 30563 (34.9) 7527 9027 (19.9) 8619 10656 (23.6)
> 80 22497 31922 (41.8) 9375 11390 (21.4) 10736 13485 (25.6)
> 96 22509 32718 (45.3) 11271 13710 (21.6) 12927 16269 (25.8)
> 128 22255 32397 (45.5) 15036 18093 (20.3) 17144 21608 (26.0)
> _______________________________________________________________________________
> SUM: BW: (16.7) CPU: (20.6) RCPU: (24.3)
> _______________________________________________________________________________
>
> > host -> guest
> _______________________________________________________________________________
> # BW1 BW2 (%) CPU1 CPU2 (%) RCPU1 RCPU2 (%)
> _______________________________________________________________________________
> *1 70706 90398 (27.8) 300 327 (9.0) 140 175 (25.0)
> 2 20951 21937 (4.7) 188 196 (4.2) 93 103 (10.7)
> 4 19952 25281 (26.7) 397 496 (24.9) 210 304 (44.7)
> 8 18559 24992 (34.6) 802 1010 (25.9) 439 659 (50.1)
> 16 18882 25608 (35.6) 1642 2082 (26.7) 953 1454 (52.5)
> 24 19012 26955 (41.7) 2465 3153 (27.9) 1452 2254 (55.2)
> 32 19846 26894 (35.5) 3278 4238 (29.2) 1914 3081 (60.9)
> 40 19704 27034 (37.2) 4104 5303 (29.2) 2409 3866 (60.4)
> 48 19721 26832 (36.0) 4924 6418 (30.3) 2898 4701 (62.2)
> 64 19650 26849 (36.6) 6595 8611 (30.5) 3975 6433 (61.8)
> 80 19432 26823 (38.0) 8244 10817 (31.2) 4985 8165 (63.7)
> 96 20347 27886 (37.0) 9913 13017 (31.3) 5982 9860 (64.8)
> 128 19108 27715 (45.0) 13254 17546 (32.3) 8153 13589 (66.6)
> _______________________________________________________________________________
> SUM: BW: (32.4) CPU: (30.4) RCPU: (62.6)
> _______________________________________________________________________________
> *: Sum over 7 iterations, remaining test cases are sum over 2 iterations
>
> > guest <-> external
>
> I haven't done this right now since I don't have a setup. I guess
> it would be limited by wire speed and gains may not be there. I
> will try to do this later when I get the setup.
OK but at least need to check that it does not hurt things.
> > in last case:
> > find some other way to measure host CPU utilization,
> > try multiqueue and single queue devices
> > 6. Use above to figure out what is a sane default for numtxqs
>
> A. Summary for default I/O (16K):
> #txqs=2 (#vhost=3): BW: (37.6) CPU: (69.2) RCPU: (40.8)
> #txqs=4 (#vhost=5): BW: (36.9) CPU: (60.9) RCPU: (25.2)
> #txqs=8 (#vhost=5): BW: (41.8) CPU: (50.0) RCPU: (15.2)
> #txqs=16 (#vhost=5): BW: (40.4) CPU: (49.9) RCPU: (10.0)
>
> B. Summary for 512 byte I/O:
> #txqs=2 (#vhost=3): BW: (31.6) CPU: (35.7) RCPU: (28.6)
> #txqs=4 (#vhost=5): BW: (5.7) CPU: (27.2) RCPU: (22.7)
> #txqs=8 (#vhost=5): BW: (-.6) CPU: (25.1) RCPU: (22.5)
> #txqs=16 (#vhost=5): BW: (-6.6) CPU: (24.7) RCPU: (21.7)
>
> Summary:
>
> 1. Average BW increase for regular I/O is best for #txq=16 with the
> least CPU utilization increase.
> 2. The average BW for 512 byte I/O is best for lower #txq=2. For higher
> #txqs, BW increased only after a particular #netperf sessions - in
> my testing that limit was 32 netperf sessions.
> 3. Multiple txq for guest by itself doesn't seem to have any issues.
> Guest CPU% increase is slightly higher than BW improvement. I
> think it is true for all mq drivers since more paths run in parallel
> upto the device instead of sleeping and allowing one thread to send
> all packets via qdisc_restart.
> 4. Having high number of txqs gives better gains and reduces cpu util
> on the guest and the host.
> 5. MQ is intended for server loads. MQ should probably not be explicitly
> specified for client systems.
> 6. No regression with numtxqs=1 (or if mq option is not used) in any
> testing scenario.
Of course txq=1 can be considered a kind of fix, but if we know the
issue is TX/RX flows getting bounced between CPUs, can we fix this?
Workload-specific optimizations can only get us this far.
>
> I will send the v3 patch within a day after some more testing.
>
> Thanks,
>
> - KK
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists