linux-kernel - Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120926132013.GB7633@turtle.usersys.redhat.com>
Date:	Wed, 26 Sep 2012 15:20:14 +0200
From:	Andrew Jones <drjones@...hat.com>
To:	Avi Kivity <avi@...hat.com>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Raghavendra K T <raghavendra.kt@...ux.vnet.ibm.com>,
	"H. Peter Anvin" <hpa@...or.com>,
	Marcelo Tosatti <mtosatti@...hat.com>,
	Ingo Molnar <mingo@...hat.com>, Rik van Riel <riel@...hat.com>,
	Srikar <srikar@...ux.vnet.ibm.com>,
	"Nikunj A. Dadhania" <nikunj@...ux.vnet.ibm.com>,
	KVM <kvm@...r.kernel.org>, Jiannan Ouyang <ouyang@...pitt.edu>,
	chegu vinod <chegu_vinod@...com>,
	"Andrew M. Theurer" <habanero@...ux.vnet.ibm.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Srivatsa Vaddagiri <srivatsa.vaddagiri@...il.com>,
	Gleb Natapov <gleb@...hat.com>
Subject: Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios
 in PLE handler

On Mon, Sep 24, 2012 at 06:20:12PM +0200, Avi Kivity wrote:
> On 09/24/2012 06:03 PM, Peter Zijlstra wrote:
> > On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote:
> >> On 09/24/2012 03:54 PM, Peter Zijlstra wrote:
> >> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote:
> >> >> However Rik had a genuine concern in the cases where runqueue is not
> >> >> equally distributed and lockholder might actually be on a different run 
> >> >> queue but not running.
> >> > 
> >> > Load should eventually get distributed equally -- that's what the
> >> > load-balancer is for -- so this is a temporary situation.
> >> 
> >> What's the expected latency?  This is the whole problem.  Eventually the
> >> scheduler would pick the lock holder as well, the problem is that it's
> >> in the millisecond scale while lock hold times are in the microsecond
> >> scale, leading to a 1000x slowdown.
> > 
> > Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes
> > something like accurate or fast, never both.
> > 
> >> If we want to yield, we really want to boost someone.
> > 
> > Now if only you knew which someone ;-) This non-modified guest nonsense
> > is such a snake pit.. but you know how I feel about all that.
> 
> Actually if I knew that in addition to boosting someone, I also unboost
> myself enough to be preempted, it wouldn't matter.  While boosting the
> lock holder is good, the main point is not spinning and doing useful
> work instead.  We can detect spinners and avoid boosting them.
> 
> That's the motivation for the "donate vruntime" approach I wanted earlier.

I'll probably get shot for the suggestion, but doesn't this problem merit
another scheduler class? We want FIFO order for a special class of tasks,
"spinners". Wouldn't a clean solution be to promote a task's scheduler
class to the spinner class when we PLE (or come from some special syscall
for userspace spinlocks?)? That class would be higher priority than the
fair class and would schedule in FIFO order, but it would only run its
tasks for short periods before switching. Also, after each task is run
its scheduler class would get reset down to its original class (fair).
At least at first thought this looks to me to be cleaner than the next
and skip hinting, plus it helps guarantee that the lock holder gets
scheduled before the tasks waiting on that lock.

Drew

> 
> > 
> >> > We already try and favour the non running vcpu in this case, that's what
> >> > yield_to_task_fair() is about. If its still not eligible to run, tough
> >> > luck.
> >> 
> >> Crazy idea: instead of yielding, just run that other vcpu in the thread
> >> that would otherwise spin.  I can see about a million objections to this
> >> already though.
> > 
> > Yah.. you want me to list a few? :-) It would require synchronization
> > with the other cpu to pull its task -- one really wants to avoid it also
> > running it.
> 
> Yeah, it's quite a horrible idea.
> 
> > 
> > Do this at a high enough frequency and you're dead too.
> > 
> > Anyway, you can do this inside the KVM stuff, simply flip the vcpu state
> > associated with a vcpu thread and use the preemption notifiers to sort
> > things against the scheduler or somesuch.
> 
> That's what I thought when I wrote this, but I can't, I might be
> preempted in random kvm code.  So my state includes the host stack and
> registers.  Maybe we can special-case when we interrupt guest mode.
> 
> > 
> >> >> Do you think instead of using rq->nr_running, we could get a global 
> >> >> sense of load using avenrun (something like avenrun/num_onlinecpus) 
> >> > 
> >> > To what purpose? Also, global stuff is expensive, so you should try and
> >> > stay away from it as hard as you possibly can.
> >> 
> >> Spinning is also expensive.  How about we do the global stuff every N
> >> times, to amortize the cost (and reduce contention)?
> > 
> > Nah, spinning isn't expensive, its a waste of time, similar end result
> > for someone who wants to do useful work though, but not the same cause.
> > 
> > Pick N and I'll come up with a scenario for which its wrong ;-)
> 
> Sure.  But if it's rare enough, then that's okay for us.
> 
> > Anyway, its an ugly problem and one I really want to contain inside the
> > insanity that created it (virt), lets not taint the rest of the kernel
> > more than we need to. 
> 
> Agreed.  Though given that postgres and others use userspace spinlocks,
> maybe it's not just virt.
> 
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/