linux-kernel - Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <504A37B0.7020605@linux.vnet.ibm.com>
Date:	Fri, 07 Sep 2012 23:36:40 +0530
From:	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
To:	habanero@...ux.vnet.ibm.com
CC:	Avi Kivity <avi@...hat.com>, Marcelo Tosatti <mtosatti@...hat.com>,
	Ingo Molnar <mingo@...hat.com>, Rik van Riel <riel@...hat.com>,
	Srikar <srikar@...ux.vnet.ibm.com>, KVM <kvm@...r.kernel.org>,
	chegu vinod <chegu_vinod@...com>,
	LKML <linux-kernel@...r.kernel.org>, X86 <x86@...nel.org>,
	Gleb Natapov <gleb@...hat.com>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

CCing PeterZ also.

On 09/07/2012 06:41 PM, Andrew Theurer wrote:
> I have noticed recently that PLE/yield_to() is still not that scalable
> for really large guests, sometimes even with no CPU over-commit.  I have
> a small change that make a very big difference.
>
> First, let me explain what I saw:
>
> Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
> thread Westmere-EX system:  645 seconds!
>
> Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
> runqueue lock for yield_to()
>
> So, I added some schedstats to yield_to(), one to count when we failed
> this test in yield_to()
>
>      if (task_running(p_rq, p) || p->state)
>
> and one when we pass all the conditions and get to actually yield:
>
>       yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>
>
> And during boot up of this guest, I saw:
>
>
> failed yield_to() because task is running: 8368810426
> successful yield_to(): 13077658
>                        0.156022% of yield_to calls
>                        1 out of 640 yield_to calls
>
> Obviously, we have a problem.  Every exit causes a loop over 80 vcpus,
> each one trying to get two locks.  This is happening on all [but one]
> vcpus at around the same time.  Not going to work well.
>

True and interesting. I had once thought of reducing overall O(n^2)
iteration to O(n log(n)) iterations by reducing number of candidates
to search to O(log(n)) instead of current O(n). May be I have to get 
back to my experiment modes.

> So, since the check for a running task is nearly always true, I moved
> that -before- the double runqueue lock, so 99.84% of the attempts do not
> take the locks.  Now, I do not know is this [not getting the locks] is a
> problem.  However, I'd rather have a little inaccurate test for a
> running vcpu than burning 98% of CPU in host kernel.  With the change
> the VM boot time went to:  100 seconds, an 85% reduction in time.
>
> I also wanted to check to see this did not affect truly over-committed
> situations, so I first started with smaller VMs at 2x cpu over-commit:
>
> 16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
>             throughput +/- stddev
>                 -----     -----
> ple off:        2281 +/- 7.32%  (really bad as expected)
> ple on:        19796 +/- 1.36%
> ple on: w/fix: 19796 +/- 1.37%  (no degrade at all)
>
> In this case the VMs are small enough, that we do not loop through
> enough vcpus to trigger the problem.  host CPU is very low (3-4% range)
> for both default ple and with yield_to() fix.
>
> So I went on to a bigger VM:
>
> 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
>             throughput +/- stddev
>                 -----     -----
> ple on:         2552 +/- .70%
> ple on: w/fix:  4621 +/- 2.12%  (81% improvement!)
>
> This is where we start seeing a major difference.  Without the fix, host
> cpu was around 70%, mostly in spin_lock.  That was reduced to 60% (and
> guest went from 30 to 40%).  I believe this is on the right track to
> reduce the spin lock contention, still get proper directed yield, and
> therefore improve the guest CPU available and its performance.
>
> However, we still have lock contention, and I think we can reduce it
> even more.  We have eliminated some attempts at double runqueue lock
> acquire because the check for the target vcpu is running is now before
> the lock.  However, even if the target-to-yield-to vcpu [for the same
> guest upon we PLE exited] is not running, the physical
> processor/runqueue that target-to-yield-to vcpu is located on could be
> running a different VM's vcpu -and- going through a directed yield,
> therefore that run queue lock may already acquired.  We do not want to
> just spin and wait, we want to move to the next candidate vcpu.  We need
> a check to see if the smp processor/runqueue is already in a directed
> yield.  Or, perhaps we just check if that cpu is not in guest mode, and
> if so, we skip that yield attempt for that vcpu and move to the next
> candidate vcpu.  So, my question is:  given a runqueue, what's the best
> way to check if that corresponding phys cpu is not in guest mode?
>

We are indeed avoiding CPUS in guest mode when we check
task->flags & PF_VCPU in vcpu_on_spin path.  Doesn't that suffice?

> Here's the changes so far (schedstat changes not included here):
>
> signed-off-by:  Andrew Theurer<habanero@...ux.vnet.ibm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..f8eff8c 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> preempt)
>
>   again:
>   	p_rq = task_rq(p);
> +	if (task_running(p_rq, p) || p->state) {
> +		goto out_no_unlock;
> +	}
>   	double_rq_lock(rq, p_rq);
>   	while (task_rq(p) != p_rq) {
>   		double_rq_unlock(rq, p_rq);
> @@ -4856,8 +4859,6 @@ again:
>   	if (curr->sched_class != p->sched_class)
>   		goto out;
>
> -	if (task_running(p_rq, p) || p->state)
> -		goto out;
>
>   	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>   	if (yielded) {
> @@ -4879,6 +4880,7 @@ again:
>
>   out:
>   	double_rq_unlock(rq, p_rq);
> +out_no_unlock:
>   	local_irq_restore(flags);
>
>   	if (yielded)
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/