linux-kernel - Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 13 Sep 2012 16:30:58 -0500
From:	Andrew Theurer <habanero@...ux.vnet.ibm.com>
To:	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
	Avi Kivity <avi@...hat.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	Ingo Molnar <mingo@...hat.com>, Rik van Riel <riel@...hat.com>,
	KVM <kvm@...r.kernel.org>, chegu vinod <chegu_vinod@...com>,
	LKML <linux-kernel@...r.kernel.org>, X86 <x86@...nel.org>,
	Gleb Natapov <gleb@...hat.com>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>
Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE
 handler

On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> * Andrew Theurer <habanero@...ux.vnet.ibm.com> [2012-09-11 13:27:41]:
> 
> > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p)
> > > >>>> +{
> > > >>>> +     if (!curr->sched_class->yield_to_task)
> > > >>>> +             return false;
> > > >>>> +
> > > >>>> +     if (curr->sched_class != p->sched_class)
> > > >>>> +             return false;
> > > >>>
> > > >>>
> > > >>> Peter,
> > > >>>
> > > >>> Should we also add a check if the runq has a skip buddy (as pointed out
> > > >>> by Raghu) and return if the skip buddy is already set.
> > > >>
> > > >> Oh right, I missed that suggestion.. the performance improvement went
> > > >> from 81% to 139% using this, right?
> > > >>
> > > >> It might make more sense to keep that separate, outside of this
> > > >> function, since its not a strict prerequisite.
> > > >>
> > > >>>>
> > > >>>> +     if (task_running(p_rq, p) || p->state)
> > > >>>> +             return false;
> > > >>>> +
> > > >>>> +     return true;
> > > >>>> +}
> > > >>
> > > >>
> > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > > >>> bool preempt)
> > > >>>>        rq = this_rq();
> > > >>>>
> > > >>>>   again:
> > > >>>> +     /* optimistic test to avoid taking locks */
> > > >>>> +     if (!__yield_to_candidate(curr, p))
> > > >>>> +             goto out_irq;
> > > >>>> +
> > > >>
> > > >> So add something like:
> > > >>
> > > >> 	/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> > > >> 	if (p_rq->cfs_rq->skip)
> > > >> 		goto out_irq;
> > > >>>
> > > >>>
> > > >>>>        p_rq = task_rq(p);
> > > >>>>        double_rq_lock(rq, p_rq);
> > > >>>
> > > >>>
> > > >> But I do have a question on this optimization though,.. Why do we check
> > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > > >>
> > > >> That is, I'd like to see this thing explained a little better.
> > > >>
> > > >> Does it go something like: p_rq is the runqueue of the task we'd like to
> > > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > > >> failing the yield_to() simply means us picking the next VCPU thread,
> > > >> which might be running on an entirely different cpu (rq) and could
> > > >> succeed?
> > > >
> > > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > > skip check.  Raghu, I am not sure if this is exactly what you want
> > > > implemented in v4.
> > > >
> > > 
> > > Andrew, Yes that is what I had. I think there was a mis-understanding. 
> > > My intention was to if there is a directed_yield happened in runqueue 
> > > (say rqA), do not bother to directed yield to that. But unfortunately as 
> > > PeterZ pointed that would have resulted in setting next buddy of a 
> > > different run queue than rqA.
> > > So we can drop this "skip" idea. Pondering more over what to do? can we 
> > > use next buddy itself ... thinking..
> > 
> > As I mentioned earlier today, I did not have your changes from kvm.git
> > tree when I tested my changes.  Here are your changes and my changes
> > compared:
> > 
> > 			  throughput in MB/sec
> > 
> > kvm_vcpu_on_spin changes:  4636 +/- 15.74%
> > yield_to changes:	   4515 +/- 12.73%
> > 
> > I would be inclined to stick with your changes which are kept in kvm
> > code.  I did try both combined, and did not get good results:
> > 
> > both changes:		   4074 +/- 19.12%
> > 
> > So, having both is probably not a good idea.  However, I feel like
> > there's more work to be done.  With no over-commit (10 VMs), total
> > throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
> > overhead, but a reduction to ~4500 is still terrible.  By contrast,
> > 8-way VMs with 2x over-commit have a total throughput roughly 10% less
> > than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
> > host).  We still have what appears to be scalability problems, but now
> > it's not so much in runqueue locks for yield_to(), but now
> > get_pid_task():
> >
> 
> Hi Andrew,
> IMHO, reducing the double runqueue lock overhead is a good idea,
> and may be  we see the benefits when we increase the overcommit further.
> 
> The explaination for not seeing good benefit on top of PLE handler
> optimization patch is because we filter the yield_to candidates,
> and hence resulting in less contention for double runqueue lock.
> and extra code overhead during genuine yield_to might have resulted in
> some degradation in the case you tested.
> 
> However, did you use cfs.next also?. I hope it helps, when we combine.
> 
> Here is the result that is showing positive benefit.
> I experimented on a 32 core (no HT) PLE machine with 32 vcpu guest(s).
>   
> +-----------+-----------+-----------+------------+-----------+
>         kernbench time in sec, lower is better 
> +-----------+-----------+-----------+------------+-----------+
>        base      stddev     patched     stddev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x    44.3880     1.8699    40.8180     1.9173	   8.04271
> 2x    96.7580     4.2787    93.4188     3.5150	   3.45108
> +-----------+-----------+-----------+------------+-----------+
> 
> 
> +-----------+-----------+-----------+------------+-----------+
>         ebizzy record/sec higher is better
> +-----------+-----------+-----------+------------+-----------+
>        base      stddev     patched     stddev      %improve
> +-----------+-----------+-----------+------------+-----------+
> 1x  2374.1250    50.9718   3816.2500    54.0681	  60.74343
> 2x  2536.2500    93.0403   2789.3750   204.7897	   9.98029
> +-----------+-----------+-----------+------------+-----------+
> 
> 
> Below is the patch which combine suggestions of peterZ on your
> original approach with cfs.next (already posted by Srikar in the other
> thread)

I did get a chance to run with the below patch and your changes in
kvm.git, but the results were not too different:

Dbench, 10 x 16-way VMs on 80-way host:

kvm_vcpu_on_spin changes:  4636 +/- 15.74%
yield_to changes:	   4515 +/- 12.73%
both changes from above:   4074 +/- 19.12%
...plus cfs.next check:    4418 +/- 16.97%

Still hovering around 4500 MB/sec

The concern I have is that even though we have gone through changes to
help reduce the candidate vcpus we yield to, we still have a very poor
idea of which vcpu really needs to run.  The result is high cpu usage in
the get_pid_task and still some contention in the double runqueue lock.
To make this scalable, we either need to significantly reduce the
occurrence of the lock-holder preemption, or do a much better job of
knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
which do not need to run).

On reducing the occurrence:  The worst case for lock-holder preemption
is having vcpus of same VM on the same runqueue.  This guarantees the
situation of 1 vcpu running while another [of the same VM] is not.  To
prove the point, I ran the same test, but with vcpus restricted to a
range of host cpus, such that any single VM's vcpus can never be on the
same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
vcpu-1's are on host cpus 5-9, and so on.  Here is the result:

kvm_cpu_spin, and all
yield_to changes, plus
restricted vcpu placement:  8823 +/- 3.20%   much, much better

On picking a better vcpu to yield to:  I really hesitate to rely on
paravirt hint [telling us which vcpu is holding a lock], but I am not
sure how else to reduce the candidate vcpus to yield to.  I suspect we
are yielding to way more vcpus than are prempted lock-holders, and that
IMO is just work accomplishing nothing.  Trying to think of way to
further reduce candidate vcpus....


-Andrew


> 
> ----8<----
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fbf1fd0..8551f57 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4820,6 +4820,24 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
> 
> +/*
> + * Tests preconditions required for sched_class::yield_to().
> + */
> +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p,
> +					 struct rq *p_rq)
> +{
> +	if (!curr->sched_class->yield_to_task)
> +		return false;
> +
> +	if (curr->sched_class != p->sched_class)
> +		return false;
> +
> +	if (task_running(p_rq, p) || p->state)
> +		return false;
> +
> +	return true;
> +}
> +
>  /**
>   * yield_to - yield the current processor to another thread in
>   * your thread group, or accelerate that thread toward the
> @@ -4844,20 +4862,24 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
> 
>  again:
>  	p_rq = task_rq(p);
> +
> +	/* optimistic test to avoid taking locks */
> +	if (!__yield_to_candidate(curr, p, p_rq))
> +		goto out_irq;
> +
> +	/* if next buddy is set, assume yield is in progress */
> +	if (p_rq->cfs.next)
> +		goto out_irq;
> +
>  	double_rq_lock(rq, p_rq);
>  	while (task_rq(p) != p_rq) {
>  		double_rq_unlock(rq, p_rq);
>  		goto again;
>  	}
> 
> -	if (!curr->sched_class->yield_to_task)
> -		goto out;
> -
> -	if (curr->sched_class != p->sched_class)
> -		goto out;
> -
> -	if (task_running(p_rq, p) || p->state)
> -		goto out;
> +	/* validate state, holding p_rq ensures p's state cannot change */
> +	if (!__yield_to_candidate(curr, p, p_rq))
> +		goto out_unlock;
> 
>  	yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>  	if (yielded) {
> @@ -4877,8 +4899,9 @@ again:
>  		rq->skip_clock_update = 0;
>  	}
> 
> -out:
> +out_unlock:
>  	double_rq_unlock(rq, p_rq);
> +out_irq:
>  	local_irq_restore(flags);
> 
>  	if (yielded)


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/