[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1351599420.23105.14.camel@oc6622382223.ibm.com>
Date: Tue, 30 Oct 2012 07:17:00 -0500
From: Andrew Theurer <habanero@...ux.vnet.ibm.com>
To: Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@...radead.org>,
"H. Peter Anvin" <hpa@...or.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Ingo Molnar <mingo@...hat.com>, Avi Kivity <avi@...hat.com>,
Rik van Riel <riel@...hat.com>,
Srikar <srikar@...ux.vnet.ibm.com>,
"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>,
KVM <kvm@...r.kernel.org>, Jiannan Ouyang <ouyang@...pitt.edu>,
Chegu Vinod <chegu_vinod@...com>,
LKML <linux-kernel@...r.kernel.org>,
Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
Gleb Natapov <gleb@...hat.com>,
Andrew Jones <drjones@...hat.com>
Subject: Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit
scenarios
On Mon, 2012-10-29 at 19:36 +0530, Raghavendra K T wrote:
> In some special scenarios like #vcpu <= #pcpu, PLE handler may
> prove very costly, because there is no need to iterate over vcpus
> and do unsuccessful yield_to burning CPU.
>
> Similarly, when we have large number of small guests, it is
> possible that a spinning vcpu fails to yield_to any vcpu of same
> VM and go back and spin. This is also not effective when we are
> over-committed. Instead, we do a yield() so that we give chance
> to other VMs to run.
>
> This patch tries to optimize above scenarios.
>
> The first patch optimizes all the yield_to by bailing out when there
> is no need to continue yield_to (i.e., when there is only one task
> in source and target rq).
>
> Second patch uses that in PLE handler.
>
> Third patch uses overall system load knowledge to take decison on
> continuing in yield_to handler, and also yielding in overcommits.
> To be precise,
> * loadavg is converted to a scale of 2048 / per CPU
> * a load value of less than 1024 is considered as undercommit and we
> return from PLE handler in those cases
> * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
> and we yield to other VMs in such cases.
>
> (let threshold = 2048)
> Rationale for using threshold/2 for undercommit limit:
> Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik)
> scenarios where we still have lock holder preempted vcpu waiting to be
> scheduled. (scenario arises when rq length is > 1 even when we are under
> committed)
>
> Rationale for using (1.75 * threshold) for overcommit scenario:
> This is a heuristic where we should probably see rq length > 1
> and a vcpu of a different VM is waiting to be scheduled.
>
> Related future work (independent of this series):
>
> - Dynamically changing PLE window depending on system load.
>
> Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
> with 32 core PLE machine with 32 vcpu guest.
> I believe we should get very good improvements for overcommit (especially > 2)
> on large machines with small vcpu guests. (Could not test this as I do not have
> access to a bigger machine)
>
> base = 3.7.0-rc1
> machine: 32 core mx3850 x5 PLE mc
>
> --+-----------+-----------+-----------+------------+-----------+
> ebizzy (rec/sec higher is beter)
> --+-----------+-----------+-----------+------------+-----------+
> base stdev patched stdev %improve
> --+-----------+-----------+-----------+------------+-----------+
> 1x 2543.3750 20.2903 6279.3750 82.5226 146.89143
> 2x 2410.8750 96.4327 2450.7500 207.8136 1.65396
> 3x 2184.9167 205.5226 2178.3333 97.2034 -0.30131
> --+-----------+-----------+-----------+------------+-----------+
>
> --+-----------+-----------+-----------+------------+-----------+
> dbench (throughput in MB/sec. higher is better)
> --+-----------+-----------+-----------+------------+-----------+
> base stdev patched stdev %improve
> --+-----------+-----------+-----------+------------+-----------+
> 1x 5545.4330 596.4344 7042.8510 1012.0924 27.00272
> 2x 1993.0970 43.6548 1990.6200 75.7837 -0.12428
> 3x 1295.3867 22.3997 1315.5208 36.0075 1.55429
> --+-----------+-----------+-----------+------------+-----------+
Could you include a PLE-off result for 1x over-commit, so we know what
the best possible result is?
Looks like skipping the yield_to() for rq = 1 helps, but I'd like to
know if the performance is the same as PLE off for 1x. I am concerned
the vcpu to task lookup is still expensive.
Based on Peter's comments I would say the 3rd patch and the 2x,3x
results are not conclusive at this time.
I think we should also discuss what we think a good target is. We
should know what our high-water mark is, and IMO, if we cannot get
close, then I do not feel we are heading down the right path. For
example, if dbench aggregate throughput for 1x with PLE off is 10000
MB/sec, then the best possible 2x,3x result, should be a little lower
than that due to task switching the vcpus and sharing chaches. This
should be quite evident with current PLE handler and smaller VMs (like
10 vcpus or less).
>
> Changes since V1:
> - Discard the idea of exporting nrrunning and optimize in core scheduler (Peter)
> - Use yield() instead of schedule in overcommit scenarios (Rik)
> - Use loadavg knowledge to detect undercommit/overcommit
>
> Peter Zijlstra (1):
> Bail out of yield_to when source and target runqueue has one task
>
> Raghavendra K T (2):
> Handle yield_to failure return for potential undercommit case
> Check system load and handle different commit cases accordingly
>
> Please let me know your comments and suggestions.
>
> Link for V1:
> https://lkml.org/lkml/2012/9/21/168
>
> kernel/sched/core.c | 25 +++++++++++++++++++------
> virt/kvm/kvm_main.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++----------
> 2 files changed, 65 insertions(+), 16 deletions(-)
-Andrew Theurer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists