[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1341870457.2909.27.camel@oc2024037011.ibm.com>
Date: Mon, 09 Jul 2012 16:47:37 -0500
From: Andrew Theurer <habanero@...ux.vnet.ibm.com>
To: Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@...or.com>,
Thomas Gleixner <tglx@...utronix.de>,
Marcelo Tosatti <mtosatti@...hat.com>,
Ingo Molnar <mingo@...hat.com>, Avi Kivity <avi@...hat.com>,
Rik van Riel <riel@...hat.com>,
S390 <linux-s390@...r.kernel.org>,
Carsten Otte <cotte@...ibm.com>,
Christian Borntraeger <borntraeger@...ibm.com>,
KVM <kvm@...r.kernel.org>, chegu vinod <chegu_vinod@...com>,
LKML <linux-kernel@...r.kernel.org>, X86 <x86@...nel.org>,
Gleb Natapov <gleb@...hat.com>, linux390@...ibm.com,
Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
Joerg Roedel <joerg.roedel@....com>
Subject: Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.
Hi, Raghu.
> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
>
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:
>
> (a) Not done PLE exit at all (probably he is preempted lock-holder)
>
> (b) VCPU skipped in last iteration because it did PL exit, and probably
> has become eligible now (next eligible lock holder)
>
> Future enhancemnets:
> (1) Currently we have a boolean to decide on eligibility of vcpu. It
> would be nice if I get feedback on guest (>32 vcpu) whether we can
> improve better with integer counter. (with counter = say f(log n )).
>
> (2) We have not considered system load during iteration of vcpu. With
> that information we can limit the scan and also decide whether schedule()
> is better. [ I am able to use #kicked vcpus to decide on this But may
> be there are better ideas like information from global loadavg.]
>
> (3) We can exploit this further with PV patches since it also knows about
> next eligible lock-holder.
>
> Summary: There is a huge improvement for moderate / no overcommit scenario
> for kvm based guest on PLE machine (which is difficult ;) ).
>
> Result:
> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
>
> Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM,
> 32 core machine
Is this with HT enabled, therefore 64 CPU threads?
> Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
> with test kernels
>
> Guest: fedora 16 with 32 vcpus 8GB memory.
Can you briefly explain the 1x and 2x configs? This of course is highly
dependent whether or not HT is enabled...
FWIW, I started testing what I would call "0.5x", where I have one 40
vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
enabled, no extra load on the system). For ebizzy, the results are
quite erratic from run to run, so I am inclined to discard it as a
workload, but maybe I should try "1x" and "2x" cpu over-commit as well.
>>From initial observations, at least for the ebizzy workload, the
percentage of exits that result in a yield_to() are very low, around 1%,
before these patches. So, I am concerned that at least for this test,
reducing that number even more has diminishing returns. I am however
still concerned about the scalability problem with yield_to(), which
shows like this for me (perf):
> 63.56% 282095 qemu-kvm [kernel.kallsyms] [k] _raw_spin_lock
> 5.42% 24420 qemu-kvm [kvm] [k] kvm_vcpu_yield_to
> 5.33% 26481 qemu-kvm [kernel.kallsyms] [k] get_pid_task
> 4.35% 20049 qemu-kvm [kernel.kallsyms] [k] yield_to
> 2.74% 15652 qemu-kvm [kvm] [k] kvm_apic_present
> 1.70% 8657 qemu-kvm [kvm] [k] kvm_vcpu_on_spin
> 1.45% 7889 qemu-kvm [kvm] [k] vcpu_enter_guest
For the cpu threads in the host that are actually active (in this case
1/2 of them), ~50% of their time is in kernel and ~43% in guest. This
is for a no-IO workload, so that's just incredible to see so much cpu
wasted. I feel that 2 important areas to tackle are a more scalable
yield_to() and reducing the number of pause exits itself (hopefully by
just tuning ple_window for the latter).
Honestly, I not confident addressing this problem will improve the
ebizzy score. That workload is so erratic for me, that I do not trust
the results at all. I have however seen consistent improvements in
disabling PLE for a http guest workload and a very high IOPS guest
workload, both with much time spent in host in the double runqueue lock
for yield_to(), so that's why I still gravitate toward that issue.
-Andrew Theurer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists