linux-kernel - Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CABk29Ns3KBwLXBSwiSe7Pv2YE9iMg+A1kPpPESWG=KNJu9dz0w@mail.gmail.com>
Date:   Mon, 9 May 2022 18:14:22 -0700
From:   Josh Don <joshdon@...gle.com>
To:     Abel Wu <wuyun.abel@...edance.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Mel Gorman <mgorman@...e.de>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

Hi Abel,

Overall this looks good, just a couple of comments.

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..79b4ff24faee 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6323,7 +6323,9 @@ static inline int select_idle_smt(struct task_struct *p, struct sched_domain *sd
>  static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool has_idle_core, int target)
>  {
>         struct cpumask *cpus = this_cpu_cpumask_var_ptr(select_idle_mask);
> -       int i, cpu, idle_cpu = -1, nr = INT_MAX;
> +       struct sched_domain_shared *sds = sd->shared;
> +       int nr, nro, weight = sd->span_weight;
> +       int i, cpu, idle_cpu = -1;
>         struct rq *this_rq = this_rq();
>         int this = smp_processor_id();
>         struct sched_domain *this_sd;
> @@ -6333,7 +6335,23 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>         if (!this_sd)
>                 return -1;
>
> +       nro = atomic_read(&sds->nr_overloaded_cpus);
> +       if (nro == weight)
> +               goto out;

This assumes that the sd we're operating on here is the LLC domain
(true for current use). Perhaps to catch future bugs from changing
this assumption, we could WARN_ON_ONCE(nro > weight).

> +
> +       nr = min_t(int, weight, p->nr_cpus_allowed);
> +
> +       /*
> +        * It's unlikely to find an idle cpu if the system is under
> +        * heavy pressure, so skip searching to save a few cycles
> +        * and relieve cache traffic.
> +        */
> +       if (weight - nro < (nr >> 4) && !has_idle_core)
> +               return -1;

nit: nr / 16 is easier to read and the compiler will do the shifting for you.

Was < intentional vs <= ? With <= you'll be able to skip the search in
the case where both sides evaluate to 0 (can happen frequently if we
have no idle cpus, and a task with a small affinity mask).

This will also get a bit confused in the case where the task has many
cpus allowed, but almost all of them on a different LLC than the one
we're considering here. Apart from caching the per-LLC
nr_cpus_allowed, we could instead use cpumask_weight(cpus) below (and
only do this in the !has_idle_core case to reduce calls to
cpumask_weight()).

> +
>         cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> +       if (nro > 1)
> +               cpumask_andnot(cpus, cpus, sdo_mask(sds));

Just
if (nro)
?

> @@ -6392,6 +6407,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>
>                 update_avg(&this_sd->avg_scan_cost, time);
>         }
> +out:
> +       if (has_idle_core)
> +               WRITE_ONCE(sds->has_idle_cores, 0);

nit: use set_idle_cores() instead (or, if you really want to avoid the
extra sds dereference, add a __set_idle_cores(sds, val) helper you can
call directly.

> @@ -7904,6 +7922,7 @@ static struct task_struct *detach_one_task(struct lb_env *env)
>                         continue;
>
>                 detach_task(p, env);
> +               update_overloaded_rq(env->src_rq);
>
>                 /*
>                  * Right now, this is only the second place where
> @@ -8047,6 +8066,9 @@ static int detach_tasks(struct lb_env *env)
>                 list_move(&p->se.group_node, tasks);
>         }
>
> +       if (detached)
> +               update_overloaded_rq(env->src_rq);
> +

Thinking about this more, I don't see an issue with moving the
update_overloaded_rq() calls to enqueue/dequeue_task, rather than here
in the attach/detach_task paths. Overloaded state only changes when we
pass the boundary of 2 runnable non-idle tasks, so thashing of the
overloaded mask is a lot less worrisome than if it were updated on the
boundary of 1 runnable task. The attach/detach_task paths run as part
of load balancing, which can be on a millisecond time scale.

Best,
Josh