linux-kernel - Re: [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently from load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Fri, 5 May 2017 14:18:54 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Tejun Heo <tj@...nel.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>,
        Chris Mason <clm@...com>, kernel-team@...com
Subject: Re: [PATCH 3/3] sched/fair: Propagate runnable_load_avg independently
 from load_avg

On 5 May 2017 at 12:42, Vincent Guittot <vincent.guittot@...aro.org> wrote:
> On 4 May 2017 at 22:30, Tejun Heo <tj@...nel.org> wrote:
>> We noticed that with cgroup CPU controller in use, the scheduling
>> latency gets wonky regardless of nesting level or weight
>> configuration.  This is easily reproducible with Chris Mason's
>> schbench[1].
>>
>> All tests are run on a single socket, 16 cores, 32 threads machine.
>> While the machine is mostly idle, it isn't completey.  There's always
>> some variable management load going on.  The command used is
>>
>>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>
>> which measures scheduling latency with 32 threads each of which
>> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
>> representative result when running from the root cgroup.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 26
>>          75.0000th: 62
>>          90.0000th: 74
>>          95.0000th: 86
>>          *99.0000th: 887
>>          99.5000th: 3692
>>          99.9000th: 10832
>>          min=0, max=13374
>>
>> The following is inside a first level CPU cgroup with the maximum
>> weight.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 31
>>          75.0000th: 65
>>          90.0000th: 71
>>          95.0000th: 91
>>          *99.0000th: 7288
>>          99.5000th: 10352
>>          99.9000th: 12496
>>          min=0, max=13023
>>
>> Note the drastic increase in p99 scheduling latency.  After
>> investigation, it turned out that the update_sd_lb_stats(), which is
>> used by load_balance() to pick the most loaded group, was often
>> picking the wrong group.  A CPU which has one schbench running and
>> another queued wouldn't report the correspondingly higher
>> weighted_cpuload() and get looked over as the target of load
>> balancing.
>>
>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>> sum of the load_avg of all active sched_entities.  Without cgroups or
>> at the root cgroup, each task's load_avg contributes directly to the
>> sum.  When a task wakes up or goes to sleep, the change is immediately
>> reflected on runnable_load_avg which in turn affects load balancing.
>>
>> However, when CPU cgroup is in use, a nested cfs_rq blocks this
>> immediate propagation.  When a task wakes up inside a cgroup, the
>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>> propagated to the parent.  It only affects the matching sched_entity's
>> load_avg over time which then gets propagated to the runnable_load_avg
>> at that level.  This makes the runnable_load_avg at the root queue
>> incorrectly include blocked load_avgs of tasks queued in nested
>> cfs_rqs causing the load balancer to make the wrong choices.
>>
>> This patch fixes the bug by propagating nested cfs_rq's
>> runnable_load_avg independently from load_avg.  Tasks still contribute
>> to its cfs_rq's runnable_load_avg the same way; however, a nested
>> cfs_rq directly propagates the scaled runnable_load_avg to the
>> matching group sched_entity's avg.runnable_load_avg and keeps the se's
>> and parent cfs_rq's runnable_load_avg in sync.
>>
>> This ensures that, for any given cfs_rq, its runnable_load_avg is the
>> sum of the scaled load_avgs of all and only active tasks queued on it
>> and its descendants.  This allows the load balancer to operate on the
>> same information whether there are nested cfs_rqs or not.
>>
>> With the patch applied, the p99 latency from inside a cgroup is
>> equivalent to the root cgroup case.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 40
>>          75.0000th: 71
>>          90.0000th: 89
>>          95.0000th: 108
>>          *99.0000th: 679
>>          99.5000th: 3500
>>          99.9000th: 10960
>>          min=0, max=13790
>>
>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>
>> Signed-off-by: Tejun Heo <tj@...nel.org>
>> Cc: Vincent Guittot <vincent.guittot@...aro.org>
>> Cc: Ingo Molnar <mingo@...hat.com>
>> Cc: Peter Zijlstra <peterz@...radead.org>
>> Cc: Mike Galbraith <efault@....de>
>> Cc: Paul Turner <pjt@...gle.com>
>> Cc: Chris Mason <clm@...com>
>> ---
>>  kernel/sched/fair.c |   55 ++++++++++++++++++++++++++++++++++------------------
>>  1 file changed, 37 insertions(+), 18 deletions(-)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3098,6 +3098,30 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>         cfs_rq->avg.util_sum = cfs_rq->avg.util_avg * LOAD_AVG_MAX;
>>  }
>>
>> +static inline void
>> +update_tg_cfs_runnable(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +{
>> +       struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>> +       long load, delta;
>> +
>> +       load = scale_load_down(calc_cfs_shares(gcfs_rq, gcfs_rq->tg,
>> +                                              shares_runnable));
>> +       delta = load - se->avg.runnable_load_avg;
>> +
>> +       /* Nothing to update */
>> +       if (!delta)
>> +               return;
>> +
>> +       /* Set new sched_entity's load */
>> +       se->avg.runnable_load_avg = load;
>> +       se->avg.runnable_load_sum = se->avg.runnable_load_avg * LOAD_AVG_MAX;
>> +
>> +       /* Update parent cfs_rq load */
>> +       add_positive(&cfs_rq->avg.runnable_load_avg, delta);
>> +       cfs_rq->avg.runnable_load_sum =
>> +               cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
>> +}
>> +
>>  /* Take into account change of load of a child task group */
>>  static inline void
>>  update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> @@ -3120,17 +3144,6 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>>         /* Update parent cfs_rq load */
>>         add_positive(&cfs_rq->avg.load_avg, delta);
>>         cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> -
>> -       /*
>> -        * If the sched_entity is already enqueued, we also have to update the
>> -        * runnable load avg.
>> -        */
>> -       if (se->on_rq) {
>> -               /* Update parent cfs_rq runnable_load_avg */
>> -               add_positive(&cfs_rq->avg.runnable_load_avg, delta);
>> -               cfs_rq->avg.runnable_load_sum =
>> -                       cfs_rq->avg.runnable_load_avg * LOAD_AVG_MAX;
>> -       }
>>  }
>>
>>  static inline void set_tg_cfs_propagate(struct cfs_rq *cfs_rq)
>> @@ -3152,16 +3165,16 @@ static inline int test_and_clear_tg_cfs_
>>  /* Update task and its cfs_rq load average */
>>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>>  {
>> -       struct cfs_rq *cfs_rq;
>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>
>>         if (entity_is_task(se))
>>                 return 0;
>>
>> +       update_tg_cfs_runnable(cfs_rq, se);
>> +
>>         if (!test_and_clear_tg_cfs_propagate(se))
>>                 return 0;
>>
>> -       cfs_rq = cfs_rq_of(se);
>> -
>>         set_tg_cfs_propagate(cfs_rq);
>>
>>         update_tg_cfs_util(cfs_rq, se);
>> @@ -3298,7 +3311,7 @@ static inline void update_load_avg(struc
>>         if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD)) {
>>                 __update_load_avg(now, cpu, &se->avg,
>>                           se->on_rq * scale_load_down(se->load.weight),
>> -                         cfs_rq->curr == se, false);
>> +                         cfs_rq->curr == se, !entity_is_task(se));
>>         }
>>
>>         decayed  = update_cfs_rq_load_avg(now, cfs_rq, true);
>> @@ -3354,8 +3367,10 @@ enqueue_entity_load_avg(struct cfs_rq *c
>>  {
>>         struct sched_avg *sa = &se->avg;
>>
>> -       cfs_rq->avg.runnable_load_avg += sa->load_avg;
>> -       cfs_rq->avg.runnable_load_sum += sa->load_sum;
>> +       if (entity_is_task(se)) {
>
> Why don't you add the runnable_load_avg of a group_entity that is enqueued ?

ok, i forgot that you propagate runnable_load_avg in entity now. But
this seems really weird and adds more  exceptions to the normal
behavior of load tracking

>
>> +               cfs_rq->avg.runnable_load_avg += sa->load_avg;
>> +               cfs_rq->avg.runnable_load_sum += sa->load_sum;
>> +       }
>>
>>         if (!sa->last_update_time) {
>>                 attach_entity_load_avg(cfs_rq, se);
>> @@ -3369,6 +3384,9 @@ dequeue_entity_load_avg(struct cfs_rq *c
>>  {
>>         struct sched_avg *sa = &se->avg;
>>
>> +       if (!entity_is_task(se))
>> +               return;
>
> Same as enqueue, you have to remove the runnable_load_avg of a
> group_entity that is dequeued
>
>> +
>>         cfs_rq->avg.runnable_load_avg =
>>                 max_t(long, cfs_rq->avg.runnable_load_avg - sa->load_avg, 0);
>>         cfs_rq->avg.runnable_load_sum =
>> @@ -3406,7 +3424,8 @@ void sync_entity_load_avg(struct sched_e
>>         u64 last_update_time;
>>
>>         last_update_time = cfs_rq_last_update_time(cfs_rq);
>> -       __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, false);
>> +       __update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)),
>> +                         &se->avg, 0, 0, !entity_is_task(se));
>>  }
>>
>>  /*