linux-kernel - Re: [PATCH 2/2] sched/fair: Relax task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210416135113.GA16445@vingu-book>
Date:   Fri, 16 Apr 2021 15:51:13 +0200
From:   Vincent Guittot <vincent.guittot@...aro.org>
To:     Valentin Schneider <valentin.schneider@....com>
Cc:     linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        Qais Yousef <qais.yousef@....com>,
        Quentin Perret <qperret@...gle.com>,
        Pavan Kondeti <pkondeti@...eaurora.org>,
        Rik van Riel <riel@...riel.com>,
        Lingutla Chandrasekhar <clingutla@...eaurora.org>
Subject: Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks

Le jeudi 15 avril 2021 à 18:58:46 (+0100), Valentin Schneider a écrit :
> Consider the following topology:
> 
>   DIE [          ]
>   MC  [    ][    ]
>        0  1  2  3
> 
>   capacity_orig_of(x \in {0-1}) < capacity_orig_of(x \in {2-3})
> 
> w/ CPUs 2-3 idle and CPUs 0-1 running CPU hogs (util_avg=1024).
> 
> When CPU2 goes through load_balance() (via periodic / NOHZ balance), it
> should pull one CPU hog from either CPU0 or CPU1 (this is misfit task
> upmigration). However, should a e.g. pcpu kworker awake on CPU0 just before
> this load_balance() happens and preempt the CPU hog running there, we would
> have, for the [0-1] group at CPU2's DIE level:
> 
> o sgs->sum_nr_running > sgs->group_weight
> o sgs->group_capacity * 100 < sgs->group_util * imbalance_pct
> 
> IOW, this group is group_overloaded.
> 
> Considering CPU0 is picked by find_busiest_queue(), we would then visit the
> preempted CPU hog in detach_tasks(). However, given it has just been
> preempted by this pcpu kworker, task_hot() will prevent it from being
> detached. We then leave load_balance() without having done anything.
> 
> Long story short, preempted misfit tasks are affected by task_hot(), while
> currently running misfit tasks are intentionally preempted by the stopper
> task to migrate them over to a higher-capacity CPU.
> 
> Align detach_tasks() with the active-balance logic and let it pick a
> cache-hot misfit task when the destination CPU can provide a capacity
> uplift.
> 
> Signed-off-by: Valentin Schneider <valentin.schneider@....com>
> ---
>  kernel/sched/fair.c | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d2d1a69d7aa7..43fc98d34276 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7493,6 +7493,7 @@ struct lb_env {
>  	enum fbq_type		fbq_type;
>  	enum migration_type	migration_type;
>  	enum group_type         src_grp_type;
> +	enum group_type         dst_grp_type;
>  	struct list_head	tasks;
>  };
>  
> @@ -7533,6 +7534,31 @@ static int task_hot(struct task_struct *p, struct lb_env *env)
>  	return delta < (s64)sysctl_sched_migration_cost;
>  }
>  
> +
> +/*
> + * What does migrating this task do to our capacity-aware scheduling criterion?
> + *
> + * Returns 1, if the task needs more capacity than the dst CPU can provide.
> + * Returns 0, if the task needs the extra capacity provided by the dst CPU
> + * Returns -1, if the task isn't impacted by the migration wrt capacity.
> + */
> +static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
> +{
> +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> +		return -1;
> +
> +	if (!task_fits_capacity(p, capacity_of(env->src_cpu))) {
> +		if (cpu_capacity_greater(env->dst_cpu, env->src_cpu))
> +			return 0;
> +		else if (cpu_capacity_greater(env->src_cpu, env->dst_cpu))
> +			return 1;
> +		else
> +			return -1;
> +	}

Being there means that task fits src_cpu capacity so why testing p against dst_cpu ?

> +
> +	return task_fits_capacity(p, capacity_of(env->dst_cpu)) ? -1 : 1;
> +}

I prefer the below which easier to read because the same var is use everywhere and you can remove cpu_capacity_greater.

static int migrate_degrades_capacity(struct task_struct *p, struct lb_env *env)
{
    unsigned long src_capacity, dst_capacity;

    if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
        return -1;

    src_capacity = capacity_of(env->src_cpu);
    dst_capacity = capacity_of(env->dst_cpu);

    if (!task_fits_capacity(p, src_capacity)) {
        if (capacity_greater(dst_capacity, src_capacity))
            return 0;
        else if (capacity_greater(src_capacity, dst_capacity))
            return 1;
        else
            return -1;
    }

    return task_fits_capacity(p, dst_capacity) ? -1 : 1;
}


> +
>  #ifdef CONFIG_NUMA_BALANCING
>  /*
>   * Returns 1, if task migration degrades locality
> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	if (tsk_cache_hot == -1)
>  		tsk_cache_hot = task_hot(p, env);
>  
> +	/*
> +	 * On a (sane) asymmetric CPU capacity system, the increase in compute
> +	 * capacity should offset any potential performance hit caused by a
> +	 * migration.
> +	 */
> +	if ((env->dst_grp_type == group_has_spare) &&

Shouldn't it be env->src_grp_type == group_misfit_task to only care of misfit task case as
stated in $subject


> +	    !migrate_degrades_capacity(p, env))
> +		tsk_cache_hot = 0;
> +
>  	if (tsk_cache_hot <= 0 ||
>  	    env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
>  		if (tsk_cache_hot == 1) {
> @@ -9310,6 +9345,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
>  	if (!sds.busiest)
>  		goto out_balanced;
>  
> +	env->dst_grp_type = local->group_type;
>  	env->src_grp_type = busiest->group_type;
>  
>  	/* Misfit tasks should be dealt with regardless of the avg load */
> -- 
> 2.25.1
>