[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5090C6F2.5030103@linux.vnet.ibm.com>
Date: Wed, 31 Oct 2012 12:06:34 +0530
From: Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
To: habanero@...ux.vnet.ibm.com, Avi Kivity <avi@...hat.com>
CC: Peter Zijlstra <peterz@...radead.org>,
"H. Peter Anvin" <hpa@...or.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Ingo Molnar <mingo@...hat.com>, Rik van Riel <riel@...hat.com>,
Srikar <srikar@...ux.vnet.ibm.com>,
"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>,
KVM <kvm@...r.kernel.org>, Jiannan Ouyang <ouyang@...pitt.edu>,
Chegu Vinod <chegu_vinod@...com>,
LKML <linux-kernel@...r.kernel.org>,
Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
Gleb Natapov <gleb@...hat.com>,
Andrew Jones <drjones@...hat.com>
Subject: Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios
On 10/30/2012 05:47 PM, Andrew Theurer wrote:
> On Mon, 2012-10-29 at 19:36 +0530, Raghavendra K T wrote:
>> In some special scenarios like #vcpu <= #pcpu, PLE handler may
>> prove very costly, because there is no need to iterate over vcpus
>> and do unsuccessful yield_to burning CPU.
>>
>> Similarly, when we have large number of small guests, it is
>> possible that a spinning vcpu fails to yield_to any vcpu of same
>> VM and go back and spin. This is also not effective when we are
>> over-committed. Instead, we do a yield() so that we give chance
>> to other VMs to run.
>>
>> This patch tries to optimize above scenarios.
>>
>> The first patch optimizes all the yield_to by bailing out when there
>> is no need to continue yield_to (i.e., when there is only one task
>> in source and target rq).
>>
>> Second patch uses that in PLE handler.
>>
>> Third patch uses overall system load knowledge to take decison on
>> continuing in yield_to handler, and also yielding in overcommits.
>> To be precise,
>> * loadavg is converted to a scale of 2048 / per CPU
>> * a load value of less than 1024 is considered as undercommit and we
>> return from PLE handler in those cases
>> * a load value of greater than 3586 (1.75 * 2048) is considered as overcommit
>> and we yield to other VMs in such cases.
>>
>> (let threshold = 2048)
>> Rationale for using threshold/2 for undercommit limit:
>> Having a load below (0.5 * threshold) is used to avoid (the concern rasied by Rik)
>> scenarios where we still have lock holder preempted vcpu waiting to be
>> scheduled. (scenario arises when rq length is > 1 even when we are under
>> committed)
>>
>> Rationale for using (1.75 * threshold) for overcommit scenario:
>> This is a heuristic where we should probably see rq length > 1
>> and a vcpu of a different VM is waiting to be scheduled.
>>
>> Related future work (independent of this series):
>>
>> - Dynamically changing PLE window depending on system load.
>>
>> Result on 3.7.0-rc1 kernel shows around 146% improvement for ebizzy 1x
>> with 32 core PLE machine with 32 vcpu guest.
>> I believe we should get very good improvements for overcommit (especially > 2)
>> on large machines with small vcpu guests. (Could not test this as I do not have
>> access to a bigger machine)
>>
>> base = 3.7.0-rc1
>> machine: 32 core mx3850 x5 PLE mc
>>
>> --+-----------+-----------+-----------+------------+-----------+
>> ebizzy (rec/sec higher is beter)
>> --+-----------+-----------+-----------+------------+-----------+
>> base stdev patched stdev %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x 2543.3750 20.2903 6279.3750 82.5226 146.89143
>> 2x 2410.8750 96.4327 2450.7500 207.8136 1.65396
>> 3x 2184.9167 205.5226 2178.3333 97.2034 -0.30131
>> --+-----------+-----------+-----------+------------+-----------+
>>
>> --+-----------+-----------+-----------+------------+-----------+
>> dbench (throughput in MB/sec. higher is better)
>> --+-----------+-----------+-----------+------------+-----------+
>> base stdev patched stdev %improve
>> --+-----------+-----------+-----------+------------+-----------+
>> 1x 5545.4330 596.4344 7042.8510 1012.0924 27.00272
>> 2x 1993.0970 43.6548 1990.6200 75.7837 -0.12428
>> 3x 1295.3867 22.3997 1315.5208 36.0075 1.55429
>> --+-----------+-----------+-----------+------------+-----------+
>
> Could you include a PLE-off result for 1x over-commit, so we know what
> the best possible result is?
Yes,
base no PLE
ebizzy_1x 7651.3000 rec/sec
ebizzy_2x 51.5000 rec/sec
ebizzy we are closer.
dbench_1x 12631.4210 MB/sec
dbench_2x 45.0842 MB/sec
(strangely dbench 1x result is not consistent sometime despite 10 runs
of 3min + 30 sec warmup runs on a 3G tmpfs. But surely it tells the trend)
>
> Looks like skipping the yield_to() for rq = 1 helps, but I'd like to
> know if the performance is the same as PLE off for 1x. I am concerned
> the vcpu to task lookup is still expensive.
>
Yes. I still see that.
> Based on Peter's comments I would say the 3rd patch and the 2x,3x
> results are not conclusive at this time.
Avi, IMO patch 1 and 2 seem to be good to go. Please let me know.
>
> I think we should also discuss what we think a good target is. We
> should know what our high-water mark is, and IMO, if we cannot get
> close, then I do not feel we are heading down the right path. For
> example, if dbench aggregate throughput for 1x with PLE off is 10000
> MB/sec, then the best possible 2x,3x result, should be a little lower
> than that due to task switching the vcpus and sharing chaches. This
> should be quite evident with current PLE handler and smaller VMs (like
> 10 vcpus or less).
Very much agree here. If we see the 2x 3x results (all/any of them).
aggregate is not near 1x. May be even 70% is a good target.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists