[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5076FE52.6010501@linux.vnet.ibm.com>
Date: Thu, 11 Oct 2012 22:43:54 +0530
From: Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
To: habanero@...ux.vnet.ibm.com
CC: Avi Kivity <avi@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
Rik van Riel <riel@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, Ingo Molnar <mingo@...hat.com>,
Marcelo Tosatti <mtosatti@...hat.com>,
Srikar <srikar@...ux.vnet.ibm.com>,
"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>,
KVM <kvm@...r.kernel.org>, Jiannan Ouyang <ouyang@...pitt.edu>,
chegu vinod <chegu_vinod@...com>,
LKML <linux-kernel@...r.kernel.org>,
Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
Gleb Natapov <gleb@...hat.com>,
Andrew Jones <drjones@...hat.com>
Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE
handler
On 10/11/2012 12:57 AM, Andrew Theurer wrote:
> On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:
>> On 10/10/2012 07:54 PM, Andrew Theurer wrote:
>>> I ran 'perf sched map' on the dbench workload for medium and large VMs,
>>> and I thought I would share some of the results. I think it helps to
>>> visualize what's going on regarding the yielding.
>>>
>>> These files are png bitmaps, generated from processing output from 'perf
>>> sched map' (and perf data generated from 'perf sched record'). The Y
>>> axis is the host cpus, each row being 10 pixels high. For these tests,
>>> there are 80 host cpus, so the total height is 800 pixels. The X axis
>>> is time (in microseconds), with each pixel representing 1 microsecond.
>>> Each bitmap plots 30,000 microseconds. The bitmaps are quite wide
>>> obviously, and zooming in/out while viewing is recommended.
>>>
>>> Each row (each host cpu) is assigned a color based on what thread is
>>> running. vCPUs of the same VM are assigned a common color (like red,
>>> blue, magenta, etc), and each vCPU has a unique brightness for that
>>> color. There are a maximum of 12 assignable colors, so in any VMs >12
>>> revert to vCPU color of gray. I would use more colors, but it becomes
>>> harder to distinguish one color from another. The white color
>>> represents missing data from perf, and black color represents any thread
>>> which is not a vCPU.
>>>
>>> For the following tests, VMs were pinned to host NUMA nodes and to
>>> specific cpus to help with consistency and operate within the
>>> constraints of the last test (gang scheduler).
>>>
>>> Here is a good example of PLE. These are 10-way VMs, 16 of them (as
>>> described above only 12 of the VMs have a color, rest are gray).
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
>>
>> This looks very nice to visualize what is happening. Beginning of the
>> graph looks little messy but later it is clear.
>>
>>>
>>> If you zoom out and look at the whole bitmap, you may notice the 4ms
>>> intervals of the scheduler. They are pretty well aligned across all
>>> cpus. Normally, for cpu bound workloads, we would expect to see each
>>> thread to run for 4 ms, then something else getting to run, and so on.
>>> That is mostly true in this test. We have 2x over-commit and we
>>> generally see the switching of threads at 4ms. One thing to note is
>>> that not all vCPU threads for the same VM run at exactly the same time,
>>> and that is expected and the whole reason for lock-holder preemption.
>>> Now, if you zoom in on the bitmap, you should notice within the 4ms
>>> intervals there is some task switching going on. This is most likely
>>> because of the yield_to initiated by the PLE handler. In this case
>>> there is not that much yielding to do. It's quite clean, and the
>>> performance is quite good.
>>>
>>> Below is an example of PLE, but this time with 20-way VMs, 8 of them.
>>> CPU over-commit is still 2x.
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
>>
>> I think this link still 10x16. Could you paste the link again?
>
> Oops
> https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ
>
>>
>>>
>>> This one looks quite different. In short, it's a mess. The switching
>>> between tasks can be lower than 10 microseconds. It basically never
>>> recovers. There is constant yielding all the time.
>>>
>>> Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
>>> scheduling patches. While I am not recommending gang scheduling, I
>>> think it's a good data point. The performance is 3.88x the PLE result.
>>>
>>> https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
Yes.. we see lot of yields.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists