linux-kernel - Re: [RFC patch] CFS fix place entity spread issue (v2)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100419140652.GA26853@Krystal>
Date:	Mon, 19 Apr 2010 10:06:52 -0400
From:	Mathieu Desnoyers <mathieu.desnoyers@...icios.com>
To:	Peter Zijlstra <peterz@...radead.org>
Cc:	Ingo Molnar <mingo@...e.hu>, Mike Galbraith <efault@....de>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	Greg Kroah-Hartman <greg@...ah.com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Jarkko Nikula <jhnikula@...il.com>,
	Tony Lindgren <tony@...mide.com>, linux-kernel@...r.kernel.org
Subject: Re: [RFC patch] CFS fix place entity spread issue (v2)

* Peter Zijlstra (peterz@...radead.org) wrote:
> On Sun, 2010-04-18 at 09:13 -0400, Mathieu Desnoyers wrote:
> 
> > Detailed explanation in my ELC2010 presentation at:
> > 
> > http://www.efficios.com/elc2010
> 
> So what happened to comprehensive changelogs?
> 
> > I've torn the CFS scheduler apart in the past days to figure out what is causing
> > this weird behavior, and the culprit seems to be place_entity(). The problem
> > appears to be the cumulative effect of letting the min_vruntime go backward when
> > putting sleepers back on the runqueue.
> 
> Right, that should not happen, ever, min_vruntime should be monotonic.
> Your explanation fails to mention how exactly this happens.
> 
> >  It lets the vruntime spread grow to
> > "entertaining" values (it is supposed to be in the 5ms range, not 18 minutes!).
> > 
> > In the original code, a max between the sched entity vruntime and the calculated
> > vruntime was supposed to "ensure that the thread time never go backward". But I
> > don't see why we even care about that.
> 
> Fairness, a task's vruntime going backwards means it can obtain more
> cputime than a while(1) loop would get, not a good thing.
> 
> >  The key point is that the min_vruntime
> > of the runqueue should not go backward.
> 
> That is certainly important too.
> 
> > I propose to fix this by calculating the relative offset from
> > min_vruntime + sysctl_sched_latency rather than directly from min_vruntime. I
> > also ensure that the value never goes below min_vruntime.
> 
> I'm not sure that's a good idea, that'll end up being a 'fairer'
> scheduler in the strict sense, but I suspect it will break the wakeup
> preemption model we have.
> 
> > Under the Xorg workload, moving a few windows around and starting firefox while
> > executing the wakeup-latency.c program (program waking up every 10ms and
> > reporting wakeup latency), this patch brings worse latency from 60ms down to
> > 12ms. Even doing a kernel compilation at the same time, the worse latency stays
> > around 20ms now.
> 
> That seems nice enough. But really, this changelog doesn't explain the
> cause at all.
> 
> /me goes try and figure out where this min_vruntime goes backwards.
> 
> So update_min_vruntime() is the sole place where we change min_vruntime,
> it is called from update_curr() where we add runtime to the task we're
> currently servicing, and dequeue_entity() where we remove someone from
> the tree.
> 
> These two sites are the only places where min_vruntime can change
> because only adding service to the leftmost task, or removal thereof can
> make the min_vruntime go forward.
> 
> So update_min_vruntime():
> 
>  - takes the smaller vruntime between the current task and the leftmost
> tree entry (current is not in the tree, and we can pass beyond the
> leftmost waiting task due to limited preemption and minimum service
> windows).
> 
>  - set min_vruntime to the larger between the existing min_vruntime and
> the smallest vruntime.
> 
> So colour me confused if I'm not seeing min_vruntime going backwards...

place_entity(), when placing a sleeper on the runqueue, decrements its vruntime
(derived from min_vruntime) from -= thresh (derived from load calculation). So
if we end up with a task woken up that does not consume enough vruntime to pass
the previous min_vruntime value, we effectively decrement min_vruntime.

The other thing I am currently looking into is that, without even considering
the question of going lower than min_vruntime, given that a woken up sleeper
gets an extra share of vruntime (still because of place_entity), it might always
run between -thresh to the previous min_runtime, therefore "stalling" the
min_vruntime values at low values while other threads push the maximum vruntime
higher, therefore increasing the spread. I'm currently doing some tests trying
to cope with this by consuming woken up sleepers vruntime more quickly for a
vruntime duration of "thresh".

I am also testing alternatives ways to deal with the min_vruntime going backward
problem, by keeping a "floor_vruntime", which ensures that place_entity() can
never decrement min_vruntime below the floor value of its previous decrement.

Sorry for the missing changelog info, I was between planes this weekend.
Hopefully my sleep-deprived explanation makes sense. ;)

Thanks,

Mathieu

> 
> 
> 

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/