linux-kernel - Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Mon, 17 Sep 2012 22:03:18 -0500
From:	Andrew Theurer <habanero@...ux.vnet.ibm.com>
To:	Avi Kivity <avi@...hat.com>
Cc:	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	Ingo Molnar <mingo@...hat.com>, Rik van Riel <riel@...hat.com>,
	KVM <kvm@...r.kernel.org>, chegu vinod <chegu_vinod@...com>,
	LKML <linux-kernel@...r.kernel.org>, X86 <x86@...nel.org>,
	Gleb Natapov <gleb@...hat.com>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>
Subject: Re: [RFC][PATCH] Improving directed yield scalability for PLE
 handler

On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
> 
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run.  The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> > 
> > On reducing the occurrence:  The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue.  This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not.  To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> > 
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
> > 
> > On picking a better vcpu to yield to:  I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing.  Trying to think of way to
> > further reduce candidate vcpus....
> 
> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
> That other vcpu gets work done (unless it is in pause loop itself) and
> the yielding vcpu gets put to sleep for a while, so it doesn't spend
> cycles spinning.  While we haven't fixed the problem at least the guest
> is accomplishing work, and meanwhile the real lock holder may get
> naturally scheduled and clear the lock.

OK, yes, if the other thread gets useful work done, then it is not
wasteful.  I was thinking of the worst case scenario, where any other
vcpu would likely spin as well, and the host side cpu-time for switching
vcpu threads was not all that productive.  Well, I suppose it does help
eliminate potential lock holding vcpus; it just seems to be not that
efficient or fast enough.

> The main problem with this theory is that the experiments don't seem to
> bear it out.

Granted, my test case is quite brutal.  It's nothing but over-committed
VMs which always have some spin lock activity.  However, we really
should try to fix the worst case scenario.

>   So maybe one of the assumptions is wrong - the yielding
> vcpu gets scheduled early.  That could be the case if the two vcpus are
> on different runqueues - you could be changing the relative priority of
> vcpus on the target runqueue, but still remain on top yourself.  Is this
> possible with the current code?
> 
> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
> and only fall back to remote vcpus when we see it didn't help.
> 
> Let's examine a few cases:
> 
> 1. spinner on cpu 0, lock holder on cpu 0
> 
> win!
> 
> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
> 
> Spinner gets put to sleep, random vcpus get to work, low lock contention
> (no double_rq_lock), by the time spinner gets scheduled we might have won
> 
> 3. spinner on cpu 0, another spinner on cpu 0
> 
> Worst case, we'll just spin some more.  Need to detect this case and
> migrate something in.

Well, we can certainly experiment and see what we get.

IMO, the key to getting this working really well on the large VMs is
finding the lock-holding cpu -quickly-.  What I think is happening is
that we go through a relatively long process to get to that one right
vcpu.  I guess I need to find a faster way to get there.

> 4. spinner on cpu 0, alone
> 
> Similar
> 
> 
> It seems we need to tie in to the load balancer.
> 
> Would changing the priority of the task while it is spinning help the
> load balancer?

Not sure.

-Andrew






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/