linux-kernel - Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <50643745.6010202@linux.vnet.ibm.com>
Date:	Thu, 27 Sep 2012 16:53:49 +0530
From:	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
To:	Avi Kivity <avi@...hat.com>, Peter Zijlstra <peterz@...radead.org>
CC:	"H. Peter Anvin" <hpa@...or.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	Ingo Molnar <mingo@...hat.com>, Rik van Riel <riel@...hat.com>,
	Srikar <srikar@...ux.vnet.ibm.com>,
	"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>,
	KVM <kvm@...r.kernel.org>, Jiannan Ouyang <ouyang@...pitt.edu>,
	chegu vinod <chegu_vinod@...com>,
	"Andrew M. Theurer" <habanero@...ux.vnet.ibm.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
	Gleb Natapov <gleb@...hat.com>,
	Andrew Jones <drjones@...hat.com>
Subject: Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios
 in PLE handler

On 09/27/2012 02:06 PM, Avi Kivity wrote:
> On 09/25/2012 03:40 PM, Raghavendra K T wrote:
>> On 09/24/2012 07:46 PM, Raghavendra K T wrote:
>>> On 09/24/2012 07:24 PM, Peter Zijlstra wrote:
>>>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
>>>>> However Rik had a genuine concern in the cases where runqueue is not
>>>>> equally distributed and lockholder might actually be on a different run
>>>>> queue but not running.
>>>>
>>>> Load should eventually get distributed equally -- that's what the
>>>> load-balancer is for -- so this is a temporary situation.
>>>>
>>>> We already try and favour the non running vcpu in this case, that's what
>>>> yield_to_task_fair() is about. If its still not eligible to run, tough
>>>> luck.
>>>
>>> Yes, I agree.
>>>
>>>>
>>>>> Do you think instead of using rq->nr_running, we could get a global
>>>>> sense of load using avenrun (something like avenrun/num_onlinecpus)
>>>>
>>>> To what purpose? Also, global stuff is expensive, so you should try and
>>>> stay away from it as hard as you possibly can.
>>>
>>> Yes, that concern only had made me to fall back to rq->nr_running.
>>>
>>> Will come back with the result soon.
>>
>> Got the result with the patches:
>> So here is the result,
>>
>> Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with
>> 1x and 2x overcommits
>>
>> Base = 3.6.0-rc5 + ple handler optimization patches
>> A = Base + checking rq_running in vcpu_on_spin() patch
>> B = Base + checking rq->nr_running in sched/core
>> C = Base - PLE
>>
>> ---+-----------+-----------+-----------+-----------+
>>     |    Ebizzy result (rec/sec higher is better)   |
>> ---+-----------+-----------+-----------+-----------+
>>     |    Base   |     A     |      B    |     C     |
>> ---+-----------+-----------+-----------+-----------+
>> 1x | 2374.1250 | 7273.7500 | 5690.8750 |  7364.3750|
>> 2x | 2536.2500 | 2458.5000 | 2426.3750 |    48.5000|
>> ---+-----------+-----------+-----------+-----------+
>>
>>     % improvements w.r.t BASE
>> ---+------------+------------+------------+
>>     |      A     |    B       |     C      |
>> ---+------------+------------+------------+
>> 1x | 206.37603  |  139.70410 |  210.19323 |
>> 2x | -3.06555   |  -4.33218  |  -98.08773 |
>> ---+------------+------------+------------+
>>
>> we are getting the benefit of almost PLE disabled case with this
>> approach. With patch B, we have dropped a bit in gain.
>> (because we still would iterate vcpus until we decide to do a directed
>> yield).
>
> This gives us a good case for tracking preemption on a per-vm basis.  As
> long as we aren't preempted, we can keep the PLE window high, and also
> return immediately from the handler without looking for candidates.

1) So do you think, deferring preemption patch ( Vatsa was mentioning
long back)  is also another thing worth trying, so we reduce the chance
of LHP.

IIRC, with defer preemption :
we will have hook in spinlock/unlock path to measure depth of lock held,
and shared with host scheduler (may be via MSRs now).
Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
give say one chance.

2) looking at the result (comparing A & C) , I do feel we have
significant in iterating over vcpus (when compared to even vmexit)
so We still would need undercommit fix sugested by PeterZ (improving by
140%). ?

So looking back at threads/ discussions so far, I am trying to
summarize, the discussions so far. I feel, at least here are the few
potential candidates to go in:

1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
2) Dynamically changing PLE window (Avi/Andrew/Chegu)
3) preempt_notify handler to identify preempted VCPUs (Avi)
4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
6) Pv spinlock
7) Jiannan's proposed improvements
8) Defer preemption patches

Did we miss anything (or added extra?)

So here are my action items:
- I plan to repost this series with what PeterZ, Rik suggested with
performance analysis.
- I ll go back and explore on (3) and (6) ..

Please Let me know..






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/