lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Tue, 25 Apr 2017 10:46:37 +0200 From: Vincent Guittot <vincent.guittot@...aro.org> To: Tejun Heo <tj@...nel.org> Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, linux-kernel <linux-kernel@...r.kernel.org>, Linus Torvalds <torvalds@...ux-foundation.org>, Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>, Chris Mason <clm@...com>, kernel-team@...com Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg On 24 April 2017 at 22:14, Tejun Heo <tj@...nel.org> wrote: > We noticed that with cgroup CPU controller in use, the scheduling > latency gets wonky regardless of nesting level or weight > configuration. This is easily reproducible with Chris Mason's > schbench[1]. > > All tests are run on a single socket, 16 cores, 32 threads machine. > While the machine is mostly idle, it isn't completey. There's always > some variable management load going on. The command used is > > schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > > which measures scheduling latency with 32 threads each of which > repeatedly runs for 15ms and then sleeps for 10ms. Here's a > representative result when running from the root cgroup. > > # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > Latency percentiles (usec) > 50.0000th: 26 > 75.0000th: 62 > 90.0000th: 74 > 95.0000th: 86 > *99.0000th: 887 > 99.5000th: 3692 > 99.9000th: 10832 > min=0, max=13374 > > The following is inside a first level CPU cgroup with the maximum > weight. > > # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > Latency percentiles (usec) > 50.0000th: 31 > 75.0000th: 65 > 90.0000th: 71 > 95.0000th: 91 > *99.0000th: 7288 > 99.5000th: 10352 > 99.9000th: 12496 > min=0, max=13023 > > Note the drastic increase in p99 scheduling latency. After > investigation, it turned out that the update_sd_lb_stats(), which is > used by load_balance() to pick the most loaded group, was often > picking the wrong group. A CPU which has one schbench running and > another queued wouldn't report the correspondingly higher > weighted_cpuload() and get looked over as the target of load > balancing. > > weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the > sum of the load_avg of all queued sched_entities. Without cgroups or > at the root cgroup, each task's load_avg contributes directly to the > sum. When a task wakes up or goes to sleep, the change is immediately > reflected on runnable_load_avg which in turn affects load balancing. > > However, when CPU cgroup is in use, a nesting cfs_rq blocks this > immediate reflection. When a task wakes up inside a cgroup, the > nested cfs_rq's runnable_load_avg is updated but doesn't get > propagated to the parent. It only affects the matching sched_entity's > load_avg over time which then gets propagated to the runnable_load_avg > at that level. This makes weighted_cpuload() often temporarily out of > sync leading to suboptimal behavior of load_balance() and increase in > scheduling latencies as shown above. > > This patch fixes the issue by updating propagate_entity_load_avg() to > always propagate to the parent's runnable_load_avg. Combined with the > previous patch, this keeps a cfs_rq's runnable_load_avg always the sum > of the scaled loads of all tasks queued below removing the artifacts > from nesting cfs_rqs. The following is from inside three levels of > nesting with the patch applied. So you are changing the purpose of propagate_entity_load_avg which aims to propagate load_avg/util_avg changes only when a task migrate and you also want to propagate the enqueue/dequeue in the parent cfs_rq->runnable_load_avg > > # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 > Latency percentiles (usec) > 50.0000th: 40 > 75.0000th: 71 > 90.0000th: 89 > 95.0000th: 108 > *99.0000th: 679 > 99.5000th: 3500 > 99.9000th: 10960 > min=0, max=13790 > > [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git > > Signed-off-by: Tejun Heo <tj@...nel.org> > Cc: Vincent Guittot <vincent.guittot@...aro.org> > Cc: Ingo Molnar <mingo@...hat.com> > Cc: Peter Zijlstra <peterz@...radead.org> > Cc: Mike Galbraith <efault@....de> > Cc: Paul Turner <pjt@...gle.com> > Cc: Chris Mason <clm@...com> > --- > kernel/sched/fair.c | 34 +++++++++++++++++++++------------- > 1 file changed, 21 insertions(+), 13 deletions(-) > > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq > > /* Take into account change of load of a child task group */ > static inline void > -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se) > +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se, > + bool propagate_avg) > { > struct cfs_rq *gcfs_rq = group_cfs_rq(se); > long load = 0, delta; > @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq > se->avg.load_avg = load; > se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX; > > - /* Update parent cfs_rq load */ > - add_positive(&cfs_rq->avg.load_avg, delta); > - cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; > + if (propagate_avg) { > + /* Update parent cfs_rq load */ > + add_positive(&cfs_rq->avg.load_avg, delta); > + cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX; > + } > > /* > * If the sched_entity is already enqueued, we also have to update the > @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_ > /* Update task and its cfs_rq load average */ > static inline int propagate_entity_load_avg(struct sched_entity *se) > { > - struct cfs_rq *cfs_rq; > + struct cfs_rq *cfs_rq = cfs_rq_of(se); > + bool propagate_avg; > > if (entity_is_task(se)) > return 0; > > - if (!test_and_clear_tg_cfs_propagate(se)) > - return 0; > - > - cfs_rq = cfs_rq_of(se); > + propagate_avg = test_and_clear_tg_cfs_propagate(se); > > - set_tg_cfs_propagate(cfs_rq); > + /* > + * We want to keep @cfs_rq->runnable_load_avg always in sync so > + * that the load balancer can accurately determine the busiest CPU > + * regardless of cfs_rq nesting. > + */ > + update_tg_cfs_load(cfs_rq, se, propagate_avg); > > - update_tg_cfs_util(cfs_rq, se); > - update_tg_cfs_load(cfs_rq, se); > + if (propagate_avg) { > + set_tg_cfs_propagate(cfs_rq); > + update_tg_cfs_util(cfs_rq, se); > + } > > - return 1; > + return propagate_avg; > } > > #else /* CONFIG_FAIR_GROUP_SCHED */
Powered by blists - more mailing lists