linux-kernel - Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtAJE_eDgR8dmScgVbOgS9sQSJg27Mw62Z1htMCmgN_Yew@mail.gmail.com>
Date:   Wed, 12 Feb 2020 16:03:28 +0100
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        Phil Auld <pauld@...hat.com>, Parth Shah <parth@...ux.ibm.com>,
        Valentin Schneider <valentin.schneider@....com>
Subject: Re: [RFC 2/4] sched/numa: replace runnable_load_avg by load_avg

On Wed, 12 Feb 2020 at 14:37, Mel Gorman <mgorman@...e.de> wrote:
>
> On Tue, Feb 11, 2020 at 06:46:49PM +0100, Vincent Guittot wrote:
> > Similarly to what has been done for the normal load balance, we can
> > replace runnable_load_avg by load_avg in numa load balancing and track
> > other statistics like the utilization and the number of running tasks to
> > get to better view of the current state of a node.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@...aro.org>
> > ---
> >  kernel/sched/fair.c | 102 ++++++++++++++++++++++++++++++++------------
> >  1 file changed, 75 insertions(+), 27 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a1ea02b5362e..6e4c2b012c48 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1473,38 +1473,35 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
> >              group_faults_cpu(ng, src_nid) * group_faults(p, dst_nid) * 4;
> >  }
> >
> > -static inline unsigned long cfs_rq_runnable_load_avg(struct cfs_rq *cfs_rq);
> > -
> > -static unsigned long cpu_runnable_load(struct rq *rq)
> > -{
> > -     return cfs_rq_runnable_load_avg(&rq->cfs);
> > -}
> > +/*
> > + * 'numa_type' describes the node at the moment of load balancing.
> > + */
> > +enum numa_type {
> > +     /* The node has spare capacity that can be used to run more tasks.  */
> > +     node_has_spare = 0,
> > +     /*
> > +      * The node is fully used and the tasks don't compete for more CPU
> > +      * cycles. Nevertheless, some tasks might wait before running.
> > +      */
> > +     node_fully_busy,
> > +     /*
> > +      * The node is overloaded and can't provide expected CPU cycles to all
> > +      * tasks.
> > +      */
> > +     node_overloaded
> > +};
> >
>
> Ok, to reconcile this with the load balancer, it would need to account
> for imbalanced but that's ok in the context of this patch.
>
> >  /* Cached statistics for all CPUs within a node */
> >  struct numa_stats {
> >       unsigned long load;
> > -
> > +     unsigned long util;
> >       /* Total compute capacity of CPUs on a node */
> >       unsigned long compute_capacity;
> > +     unsigned int nr_running;
> > +     unsigned int weight;
> > +     enum numa_type node_type;
> >  };
> >
> > -/*
> > - * XXX borrowed from update_sg_lb_stats
> > - */
> > -static void update_numa_stats(struct numa_stats *ns, int nid)
> > -{
> > -     int cpu;
> > -
> > -     memset(ns, 0, sizeof(*ns));
> > -     for_each_cpu(cpu, cpumask_of_node(nid)) {
> > -             struct rq *rq = cpu_rq(cpu);
> > -
> > -             ns->load += cpu_runnable_load(rq);
> > -             ns->compute_capacity += capacity_of(cpu);
> > -     }
> > -
> > -}
> > -
> >  struct task_numa_env {
> >       struct task_struct *p;
> >
> > @@ -1521,6 +1518,47 @@ struct task_numa_env {
> >       int best_cpu;
> >  };
> >
> > +static unsigned long cpu_load(struct rq *rq);
> > +static unsigned long cpu_util(int cpu);
> > +
> > +static inline enum
> > +numa_type numa_classify(unsigned int imbalance_pct,
> > +                      struct numa_stats *ns)
> > +{
> > +     if ((ns->nr_running > ns->weight) &&
> > +         ((ns->compute_capacity * 100) < (ns->util * imbalance_pct)))
> > +             return node_overloaded;
> > +
>
> Ok, so this is essentially group_is_overloaded.
>
>
> > +     if ((ns->nr_running < ns->weight) ||
> > +         ((ns->compute_capacity * 100) > (ns->util * imbalance_pct)))
> > +             return node_has_spare;
> > +
>
> And this is group_has_capacity. What I did was have a common helper
> for both NUMA and normal load balancing and translated the fields from
> sg_lb_stats and numa_stats into a common helper. This is to prevent them
> getting out of sync. The conversion was incomplete in my case but in
> principle, both NUMA and CPU load balancing should use common helpers or
> they'll get out of sync.

I fact, I wanted to keep this patch simple and readable for the 1st
version in order to not afraid people from reviewing it. That's the
main reason I didn't merge it with load_balance but i agree that some
common helper function might be possible.

Also the struct sg_lb_stats has a lot more fields compared to struct numa_stats

Then, I wonder if we could end up with different rules for numa like
taking into account some NUMA specifics metrics to classify the node

>
>
> > +     return node_fully_busy;
> > +}
> > +
> > +/*
> > + * XXX borrowed from update_sg_lb_stats
> > + */
> > +static void update_numa_stats(struct task_numa_env *env,
> > +                           struct numa_stats *ns, int nid)
> > +{
> > +     int cpu;
> > +
> > +     memset(ns, 0, sizeof(*ns));
> > +     for_each_cpu(cpu, cpumask_of_node(nid)) {
> > +             struct rq *rq = cpu_rq(cpu);
> > +
> > +             ns->load += cpu_load(rq);
> > +             ns->util += cpu_util(cpu);
> > +             ns->nr_running += rq->cfs.h_nr_running;
> > +             ns->compute_capacity += capacity_of(cpu);
> > +     }
> > +
> > +     ns->weight = cpumask_weight(cpumask_of_node(nid));
> > +
> > +     ns->node_type = numa_classify(env->imbalance_pct, ns);
> > +}
> > +
>
> Ok, this is more or less what I had except I wedged other stuff in there
> too specific to NUMA balancing to avoid multiple walks of the cpumask.
>
> >  static void task_numa_assign(struct task_numa_env *env,
> >                            struct task_struct *p, long imp)
> >  {
> > @@ -1556,6 +1594,11 @@ static bool load_too_imbalanced(long src_load, long dst_load,
> >       long orig_src_load, orig_dst_load;
> >       long src_capacity, dst_capacity;
> >
> > +
> > +     /* If dst node has spare capacity, there is no real load imbalance */
> > +     if (env->dst_stats.node_type == node_has_spare)
> > +             return false;
> > +
>
> Not exactly what the load balancer thinks though, the load balancer
> may decide to balance the tasks between NUMA groups even when there is
> capacity. That said, what you did here is compatible with the current code.
>
> I'll want to test this further but in general
>
> Acked-by: Mel Gorman <mgorman@...hsingularity.net>
>
> However, I really would have preferred if numa_classify shared common
> code with group_[is_overloaded|has_capacity] instead of having
> equivalent but different implementations.
>
> >       /*
> >        * The load is corrected for the CPU capacity available on each node.
> >        *
> > @@ -1788,10 +1831,10 @@ static int task_numa_migrate(struct task_struct *p)
> >       dist = env.dist = node_distance(env.src_nid, env.dst_nid);
> >       taskweight = task_weight(p, env.src_nid, dist);
> >       groupweight = group_weight(p, env.src_nid, dist);
> > -     update_numa_stats(&env.src_stats, env.src_nid);
> > +     update_numa_stats(&env, &env.src_stats, env.src_nid);
> >       taskimp = task_weight(p, env.dst_nid, dist) - taskweight;
> >       groupimp = group_weight(p, env.dst_nid, dist) - groupweight;
> > -     update_numa_stats(&env.dst_stats, env.dst_nid);
> > +     update_numa_stats(&env, &env.dst_stats, env.dst_nid);
> >
> >       /* Try to find a spot on the preferred nid. */
> >       task_numa_find_cpu(&env, taskimp, groupimp);
> > @@ -1824,7 +1867,7 @@ static int task_numa_migrate(struct task_struct *p)
> >
> >                       env.dist = dist;
> >                       env.dst_nid = nid;
> > -                     update_numa_stats(&env.dst_stats, env.dst_nid);
> > +                     update_numa_stats(&env, &env.dst_stats, env.dst_nid);
> >                       task_numa_find_cpu(&env, taskimp, groupimp);
> >               }
> >       }
> > @@ -5446,6 +5489,11 @@ static unsigned long cpu_load_without(struct rq *rq, struct task_struct *p)
> >       return load;
> >  }
> >
> > +static unsigned long cpu_runnable_load(struct rq *rq)
> > +{
> > +     return cfs_rq_runnable_load_avg(&rq->cfs);
> > +}
> > +
> >  static unsigned long capacity_of(int cpu)
> >  {
> >       return cpu_rq(cpu)->cpu_capacity;
> > --
> > 2.17.1
> >
>
> --
> Mel Gorman
> SUSE Labs