linux-kernel - Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140611082433.GH3213@twins.programming.kicks-ass.net>
Date:	Wed, 11 Jun 2014 10:24:33 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Michael wang <wangyun@...ux.vnet.ibm.com>
Cc:	Mike Galbraith <umgwanakikbuti@...il.com>,
	Rik van Riel <riel@...hat.com>,
	LKML <linux-kernel@...r.kernel.org>,
	Ingo Molnar <mingo@...nel.org>, Alex Shi <alex.shi@...aro.org>,
	Paul Turner <pjt@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Daniel Lezcano <daniel.lezcano@...aro.org>
Subject: Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

On Wed, Jun 11, 2014 at 02:13:42PM +0800, Michael wang wrote:
> Hi, Peter
> 
> Thanks for the reply :)
> 
> On 06/10/2014 08:12 PM, Peter Zijlstra wrote:
> [snip]
> >> Wake-affine for sure pull tasks together for workload like dbench, what make
> >> it difference when put dbench into a group one level deeper is the
> >> load-balance, which happened less.
> > 
> > We load-balance less (frequently) or we migrate less tasks due to
> > load-balancing ?
> 
> IMHO, when we put tasks one group deeper, in other word the totally
> weight of these tasks is 1024 (prev is 3072), the load become more
> balancing in root, which make bl-routine consider the system is
> balanced, which make we migrate less in lb-routine.

But how? The absolute value (1024 vs 3072) is of no effect to the
imbalance, the imbalance is computed from relative differences between
cpus.

> Our comparison is based on the same busy-system, all the two cases have
> the same workload running, the only difference is that we put the same
> workload (dbench + stress) one group deeper, it's like:
> 
> Good case:
> 		root
> 		l1-A	l1-B	l1-C
> 		dbench	stress	stress
> 
> 	results:
> 		dbench got around 300%
> 		each stress got around 450%
> 
> Bad case:
> 		root
> 		l1
> 		l2-A	l2-B	l2-C
> 		dbench	stress	stress
> 
> 	results:
> 		dbench got around 100% (throughout dropped too)
> 		each stress got around 550%
> 
> Although the l1-group gain the same resources (1200%), it doesn't assign
> to l2-ABC correctly like the root-group did.

But in this case select_idle_sibling() should function identially, so
that cannot be the problem.

> > The second is adding the cgroup crap on.
> > 
> >> However, in our cases the load balance could not help on that, since deeper
> >> the group is, less the load effect it means to root group.
> > 
> > But since all actual load is on the same depth, the relative threshold
> > (imbalance pct) should work the same, the size of the values don't
> > matter, the relative ratios do.
> 
> Exactly, however, when group is deep, the chance of it to make root
> imbalance reduced, in good case, gathered on cpu means 1024 load, while
> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
> imbalance and gain help from the routine, please note that although
> dbench and stress are the only workload in system, there are still other
> tasks serve for the system need to be wakeup (some very actively since
> the dbench...), compared to them, deep group load means nothing...

What tasks are these? And is it their interference that disturbs
load-balancing?

> >> By which means even tasks in deep group all gathered on one CPU, the load
> >> could still balanced from the view of root group, and the tasks lost the
> >> only chances (balance) to spread when they already on the same CPU...
> > 
> > Sure, but see above.
> 
> The lb-routine could not provide enough help for deep group, since the
> imbalance happened inside the group could not cause imbalance in root,
> ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
> easily ignored, but inside the l2-group, the gathered case could already
> means imbalance like (1024 * 5) : 1024.

your explanation is not making sense, we have 3 cgroups, so the total
root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170.

And again, the absolute value doesn't matter, with (istr) 12 cpus the
avg cpu load would be 3072/12 ~ 256, and 170 is significant on that
scale.

Same with l2, total weight of 1024, giving a per task weight of ~56 and
a per-cpu weight of ~85, which is again significant.

Also, you said load-balance doesn't usually participate much because
dbench is too fast, so please make up your mind, does it or doesn't it
matter?

> > So I think that approach is wrong, select_idle_siblings() works because
> > we want to keep CPUs from being idle, but if they're not actually idle,
> > pretending like they are (in a cgroup) is actively wrong and can skew
> > load pretty bad.
> 
> We only choose the timing when no idle cpu located, and flips is
> somewhat high, also the group is deep.

-enotmakingsense

> In such cases, select_idle_siblings() doesn't works anyway, it return
> the target even it is very busy, we just check twice to prevent it from
> making some obviously bad decision ;-)

-emakinglesssense

> > Furthermore, if as I expect, dbench sucks on a busy system, then the
> > proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
> > alter behaviour like that.
> 
> That's true and that's why we currently still need to shut down the
> GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
> solve later...

more confusion..

> What we currently expect is that the cgroup assign the resource
> according to the share, it works well in l1-groups, so we expect it to
> work the same well in l2-groups...

Sure, but explain why it isn't? So far you're just saying words that
don't compute.

Content of type "application/pgp-signature" skipped