linux-kernel - Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups) in load balance path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 10 May 2017 08:50:14 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Tejun Heo <tj@...nel.org>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Linus Torvalds <torvalds@...ux-foundation.org>,
        Mike Galbraith <efault@....de>, Paul Turner <pjt@...gle.com>,
        Chris Mason <clm@...com>, kernel-team@...com
Subject: Re: [PATCH v2 for-4.12-fixes 2/2] sched/fair: Fix O(# total cgroups)
 in load balance path

Hi Tejun,

On 9 May 2017 at 18:18, Tejun Heo <tj@...nel.org> wrote:
> Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all
> live cfs_rqs which have ever been active on the CPU; unfortunately,
> this makes update_blocked_averages() O(# total cgroups) which isn't
> scalable at all.

Dietmar raised similar optimization in the past. The only question was
: what is the impact of  re-adding the cfs_rq in leaf_cfs_rq_list on
the wake up path ? Have you done some measurements ?

>
> This shows up as a small CPU consumption and scheduling latency
> increase in the load balancing path in systems with CPU controller
> enabled across most cgroups.  In an edge case where temporary cgroups
> were leaking, this caused the kernel to consume good several tens of
> percents of CPU cycles running update_blocked_averages(), each run
> taking multiple millisecs.
>
> This patch fixes the issue by taking empty and fully decayed cfs_rqs
> off the rq->leaf_cfs_rq_list.
>
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Mike Galbraith <efault@....de>
> Cc: Paul Turner <pjt@...gle.com>
> Cc: Chris Mason <clm@...com>
> Cc: stable@...r.kernel.org
> ---
> Just refreshed on top of the first patch.
>
>  kernel/sched/fair.c |   19 ++++++++++++++-----
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -369,8 +369,9 @@ static inline void list_del_leaf_cfs_rq(
>  }
>
>  /* Iterate thr' all leaf cfs_rq's on a runqueue */
> -#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> -       list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)                     \
> +       list_for_each_entry_safe(cfs_rq, pos, &rq->leaf_cfs_rq_list,    \
> +                                leaf_cfs_rq_list)
>
>  /* Do the two (enqueued) entities belong to the same group ? */
>  static inline struct cfs_rq *
> @@ -463,7 +464,7 @@ static inline void list_del_leaf_cfs_rq(
>  {
>  }
>
> -#define for_each_leaf_cfs_rq(rq, cfs_rq) \
> +#define for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos)     \
>                 for (cfs_rq = &rq->cfs; cfs_rq; cfs_rq = NULL)
>
>  static inline struct sched_entity *parent_entity(struct sched_entity *se)
> @@ -6984,7 +6985,7 @@ static void attach_tasks(struct lb_env *
>  static void update_blocked_averages(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
> -       struct cfs_rq *cfs_rq;
> +       struct cfs_rq *cfs_rq, *pos;
>         struct rq_flags rf;
>
>         rq_lock_irqsave(rq, &rf);
> @@ -6994,7 +6995,7 @@ static void update_blocked_averages(int
>          * Iterates the task_group tree in a bottom up fashion, see
>          * list_add_leaf_cfs_rq() for details.
>          */
> -       for_each_leaf_cfs_rq(rq, cfs_rq) {
> +       for_each_leaf_cfs_rq_safe(rq, cfs_rq, pos) {
>                 struct sched_entity *se;
>
>                 /* throttled entities do not contribute to load */
> @@ -7008,6 +7009,14 @@ static void update_blocked_averages(int
>                 se = cfs_rq->tg->se[cpu];
>                 if (se && !skip_blocked_update(se))
>                         update_load_avg(se, 0);
> +
> +               /*
> +                * There can be a lot of idle CPU cgroups.  Don't let fully
> +                * decayed cfs_rqs linger on the list.
> +                */
> +               if (!cfs_rq->load.weight && !cfs_rq->avg.load_sum &&
> +                   !cfs_rq->avg.util_sum && !cfs_rq->runnable_load_sum)
> +                       list_del_leaf_cfs_rq(cfs_rq);

list_add_leaf_cfs_rq() assumes that we always enqueue cfs_rq bottom-up.
By removing  cfs_rq, can't we break this assumption in some cases ?

Regards,
Vincent

>         }
>         rq_unlock_irqrestore(rq, &rf);
>  }