linux-kernel - Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Wed, 10 Oct 2012 23:13:18 +0530
From:	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
To:	habanero@...ux.vnet.ibm.com
CC:	Avi Kivity <avi@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Rik van Riel <riel@...hat.com>,
	"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	Srikar <srikar@...ux.vnet.ibm.com>,
	"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>,
	KVM <kvm@...r.kernel.org>, Jiannan Ouyang <ouyang@...pitt.edu>,
	chegu vinod <chegu_vinod@...com>,
	LKML <linux-kernel@...r.kernel.org>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
	Gleb Natapov <gleb@...hat.com>,
	Andrew Jones <drjones@...hat.com>
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE
 handler

On 10/10/2012 07:54 PM, Andrew Theurer wrote:
> I ran 'perf sched map' on the dbench workload for medium and large VMs,
> and I thought I would share some of the results.  I think it helps to
> visualize what's going on regarding the yielding.
>
> These files are png bitmaps, generated from processing output from 'perf
> sched map' (and perf data generated from 'perf sched record').  The Y
> axis is the host cpus, each row being 10 pixels high.  For these tests,
> there are 80 host cpus, so the total height is 800 pixels.  The X axis
> is time (in microseconds), with each pixel representing 1 microsecond.
> Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
> obviously, and zooming in/out while viewing is recommended.
>
> Each row (each host cpu) is assigned a color based on what thread is
> running.  vCPUs of the same VM are assigned a common color (like red,
> blue, magenta, etc), and each vCPU has a unique brightness for that
> color.  There are a maximum of 12 assignable colors, so in any VMs >12
> revert to vCPU color of gray. I would use more colors, but it becomes
> harder to distinguish one color from another.  The white color
> represents missing data from perf, and black color represents any thread
> which is not a vCPU.
>
> For the following tests, VMs were pinned to host NUMA nodes and to
> specific cpus to help with consistency and operate within the
> constraints of the last test (gang scheduler).
>
> Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
> described above only 12 of the VMs have a color, rest are gray).
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

This looks very nice to visualize what is happening. Beginning of the 
graph looks little messy but later it is clear.

>
> If you zoom out and look at the whole bitmap, you may notice the 4ms
> intervals of the scheduler.  They are pretty well aligned across all
> cpus.  Normally, for cpu bound workloads, we would expect to see each
> thread to run for 4 ms, then something else getting to run, and so on.
> That is mostly true in this test.  We have 2x over-commit and we
> generally see the switching of threads at 4ms.  One thing to note is
> that not all vCPU threads for the same VM run at exactly the same time,
> and that is expected and the whole reason for lock-holder preemption.
> Now, if you zoom in on the bitmap, you should notice within the 4ms
> intervals there is some task switching going on.  This is most likely
> because of the yield_to initiated by the PLE handler.  In this case
> there is not that much yielding to do.   It's quite clean, and the
> performance is quite good.
>
> Below is an example of PLE, but this time with 20-way VMs, 8 of them.
> CPU over-commit is still 2x.
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

I think this link still 10x16. Could you paste the link again?

>
> This one looks quite different.  In short, it's a mess.  The switching
> between tasks can be lower than 10 microseconds.  It basically never
> recovers.  There is constant yielding all the time.
>
> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> scheduling patches.  While I am not recommending gang scheduling, I
> think it's a good data point.  The performance is 3.88x the PLE result.
>
> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
>
> Note that the task switching intervals of 4ms are quite obvious again,
> and this time all vCPUs from same VM run at the same time.  It
> represents the best possible outcome.
>
>
> Anyway, I thought the bitmaps might help better visualize what's going
> on.
>
> -Andrew
>
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/