linux-kernel - Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50810FB0.9000507@linux.vnet.ibm.com>
Date:	Fri, 19 Oct 2012 14:00:40 +0530
From:	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
To:	habanero@...ux.vnet.ibm.com
CC:	Avi Kivity <avi@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Rik van Riel <riel@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	Srikar <srikar@...ux.vnet.ibm.com>,
	"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>,
	KVM <kvm@...r.kernel.org>, Jiannan Ouyang <ouyang@...pitt.edu>,
	chegu vinod <chegu_vinod@...com>,
	LKML <linux-kernel@...r.kernel.org>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
	Gleb Natapov <gleb@...hat.com>,
	Andrew Jones <drjones@...hat.com>
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE
 handler

On 10/15/2012 08:04 PM, Andrew Theurer wrote:
> On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
>> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
>>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
>>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
>>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
>>>>>> * Avi Kivity <avi@...hat.com> [2012-10-04 17:00:28]:
>>>>>>
>>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
>>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
>>>>>>>>>
>> [...]
>>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
>>>>> has just terrible scalability to begin with.  I do not think we should
>>>>> try to optimize such a bad workload.
>>>>>
>>>>
>>>> I think my way of running dbench has some flaw, so I went to ebizzy.
>>>> Could you let me know how you generally run dbench?
>>>
>>> I mount a tmpfs and then specify that mount for dbench to run on.  This
>>> eliminates all IO.  I use a 300 second run time and number of threads is
>>> equal to number of vcpus.  All of the VMs of course need to have a
>>> synchronized start.
>>>
>>> I would also make sure you are using a recent kernel for dbench, where
>>> the dcache scalability is much improved.  Without any lock-holder
>>> preemption, the time in spin_lock should be very low:
>>>
>>>
>>>>       21.54%      78016         dbench  [kernel.kallsyms]   [k] copy_user_generic_unrolled
>>>>        3.51%      12723         dbench  libc-2.12.so        [.] __strchr_sse42
>>>>        2.81%      10176         dbench  dbench              [.] child_run
>>>>        2.54%       9203         dbench  [kernel.kallsyms]   [k] _raw_spin_lock
>>>>        2.33%       8423         dbench  dbench              [.] next_token
>>>>        2.02%       7335         dbench  [kernel.kallsyms]   [k] __d_lookup_rcu
>>>>        1.89%       6850         dbench  libc-2.12.so        [.] __strstr_sse42
>>>>        1.53%       5537         dbench  libc-2.12.so        [.] __memset_sse2
>>>>        1.47%       5337         dbench  [kernel.kallsyms]   [k] link_path_walk
>>>>        1.40%       5084         dbench  [kernel.kallsyms]   [k] kmem_cache_alloc
>>>>        1.38%       5009         dbench  libc-2.12.so        [.] memmove
>>>>        1.24%       4496         dbench  libc-2.12.so        [.] vfprintf
>>>>        1.15%       4169         dbench  [kernel.kallsyms]   [k] __audit_syscall_exit
>>>
>>
>> Hi Andrew,
>> I ran the test with dbench with tmpfs. I do not see any improvements in
>> dbench for 16k ple window.
>>
>> So it seems apart from ebizzy no workload benefited by that. and I
>> agree that, it may not be good to optimize for ebizzy.
>> I shall drop changing to 16k default window and continue with other
>> original patch series. Need to experiment with latest kernel.
>
> Thanks for running this again.  I do believe there are some workloads,
> when run at 1x overcommit, would benefit from a larger ple_window [with
> he current ple handling code], but I do not also want to potentially
> degrade >1x with a larger window.  I do, however, think there may be a
> another option.  I have not fully worked this out, but I think I am on
> to something.
>
> I decided to revert back to just a yield() instead of a yield_to().  My
> motivation was that yield_to() [for large VMs] is like a dog chasing its
> tail, round and round we go....   Just yield(), in particular a yield()
> which results in yielding to something -other- than the current VM's
> vcpus, helps synchronize the execution of sibling vcpus by deferring
> them until the lock holder vcpu is running again.  The more we can do to
> get all vcpus running at the same time, the far less we deal with the
> preemption problem.  The other benefit is that yield() is far, far lower
> overhead than yield_to()
>
> This does assume that vcpus from same VM do not share same runqueues.
> Yielding to a sibling vcpu with yield() is not productive for larger VMs
> in the same way that yield_to() is not.  My recent results include
> restricting vcpu placement so that sibling vcpus do not get to run on
> the same runqueue.  I do believe we could implement a initial placement
> and load balance policy to strive for this restriction (making it purely
> optional, but I bet could also help user apps which use spin locks).
>
> For 1x VMs which still vm_exit due to PLE, I believe we could probably
> just leave the ple_window alone, as long as we mostly use yield()
> instead of yield_to().  The problem with the unneeded exits in this case
> has been the overhead in routines leading up to yield_to() and the
> yield_to() itself.  If we use yield() most of the time, this overhead
> will go away.
>
> Here is a comparison of yield_to() and yield():
>
> dbench with 20-way VMs, 8 of them on 80-way host:
>
> no PLE			  426 +/- 11.03%
> no PLE w/ gangsched	32001 +/- .37%
> PLE with yield()	29207 +/- .28%
> PLE with yield_to()	 8175 +/- 1.37%
>
> Yield() is far and way better than yield_to() here and almost approaches
> gang sched result.  Here is a link for the perf sched map bitmap:
>
> https://docs.google.com/open?id=0B6tfUNlZ-14weXBfVnFFZGw1akU
>
> The thrashing is way down and sibling vcpus tend to run together,
> approximating the behavior of the gang scheduling without needing to
> actually implement gang scheduling.
>
> I did test a smaller VM:
>
> dbench with 10-way VMs, 16 of them on 80-way host:
>
> no PLE			 6248 +/- 7.69%	
> no PLE w/ gangsched	28379 +/- .07%
> PLE with yield()	29196 +/- 1.62%
> PLE with yield_to()	32217 +/- 1.76%

Hi Andrew, Results are encouraging.

>
> There is some degrade from yield() to yield_to() here, but nearly as
> large as the uplift we see on the larger VMs.  Regardless, I have an
> idea to fix that: Instead of using yield() all the time, we could use
> yield_to(), but limit the rate per vcpu to something like 1 per jiffie.
> All other exits use yield().  That rate of yield_to() should be more
> than enough for the smaller VMs, and the result should be hopefully just
> the same as the current code.  I have not coded this up yet, but it's my
> next step.

I personally feel rate limiting yield_to may be a good idea.

>
> I am also hopeful the limitation of yield_to() will also make the 1x
> issue just go away as well (even with 4096 ple_window).  The vast
> majority of exits will result in yield() which should be harmless.
>
> Keep in mind this did require ensuring sibling vcpus do not share host
> runqueues -I do think that can be possible given some optional scheduler
> tweaks.

I think this is a concern (placing). Having rate limit alone may
suffice.May be tuning that taking into overcommitted/non-overcommitted
scenario also into account would be better.

Okay below is my V2 implementation I am experimenting

1) check source -and- target runq to decide on exiting the ple handler
2)

vcpu_on_spin()
{

  .....
  if yield_to_same_vm did not succeed and we are overcommitted
     yield()

}

I think combining your thoughts and (2) complicates scenario a bit.
anyways let me see how my experiment goes. I will also check how yield
performs without any pinning.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/