linux-kernel - Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when updating misfit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <213f94df-cc36-4281-805d-9f56cbfef796@arm.com>
Date: Mon, 22 Jan 2024 09:59:42 +0000
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Qais Yousef <qyousef@...alina.io>, Ingo Molnar <mingo@...nel.org>,
 Peter Zijlstra <peterz@...radead.org>,
 Vincent Guittot <vincent.guittot@...aro.org>
Cc: linux-kernel@...r.kernel.org, Pierre Gondois <Pierre.Gondois@....com>
Subject: Re: [PATCH v4 1/2] sched/fair: Check a task has a fitting cpu when
 updating misfit

On 05/01/2024 23:20, Qais Yousef wrote:
> From: Qais Yousef <qais.yousef@....com>
> 
> If a misfit task is affined to a subset of the possible cpus, we need to
> verify that one of these cpus can fit it. Otherwise the load balancer
> code will continuously trigger needlessly leading the balance_interval
> to increase in return and eventually end up with a situation where real
> imbalances take a long time to address because of this impossible
> imbalance situation.
> 
> This can happen in Android world where it's common for background tasks
> to be restricted to little cores.
> 
> Similarly if we can't fit the biggest core, triggering misfit is
> pointless as it is the best we can ever get on this system.
> 
> To be able to detect that; we use asym_cap_list to iterate through
> capacities in the system to see if the task is able to run at a higher
> capacity level based on its p->cpus_ptr. To do so safely, we convert the
> list to be RCU protected.
> 
> To be able to iterate through capacity levels, export asym_cap_list to
> allow for fast traversal of all available capacity levels in the system.
> 
> Test:
> =====
> 
> Add
> 
> 	trace_printk("balance_interval = %lu\n", interval)
> 
> in get_sd_balance_interval().
> 
> run
> 	if [ "$MASK" != "0" ]; then
> 		adb shell "taskset -a $MASK cat /dev/zero > /dev/null"
> 	fi
> 	sleep 10
> 	// parse ftrace buffer counting the occurrence of each valaue
> 
> Where MASK is either:
> 
> 	* 0: no busy task running

.. no busy task stands for no misfit scenario?

> 	* 1: busy task is pinned to 1 cpu; handled today to not cause
> 	  misfit
> 	* f: busy task pinned to little cores, simulates busy background
> 	  task, demonstrates the problem to be fixed
> 

[...]

> +	/*
> +	 * If the task affinity is not set to default, make sure it is not
> +	 * restricted to a subset where no CPU can ever fit it. Triggering
> +	 * misfit in this case is pointless as it has no where better to move
> +	 * to. And it can lead to balance_interval to grow too high as we'll
> +	 * continuously fail to move it anywhere.
> +	 */
> +	if (!cpumask_equal(p->cpus_ptr, cpu_possible_mask)) {

Shouldn't this be cpu_active_mask ?

include/linux/cpumask.h

 * cpu_possible_mask- has bit 'cpu' set iff cpu is populatable
 * cpu_present_mask - has bit 'cpu' set iff cpu is populated
 * cpu_online_mask  - has bit 'cpu' set iff cpu available to scheduler
 * cpu_active_mask  - has bit 'cpu' set iff cpu available to migration


> +		unsigned long clamped_util = clamp(util, uclamp_min, uclamp_max);
> +		bool has_fitting_cpu = false;
> +		struct asym_cap_data *entry;
> +
> +		rcu_read_lock();
> +		list_for_each_entry_rcu(entry, &asym_cap_list, link) {
> +			if (entry->capacity > cpu_cap) {
> +				cpumask_t *cpumask;
> +
> +				if (clamped_util > entry->capacity)
> +					continue;
> +
> +				cpumask = cpu_capacity_span(entry);
> +				if (!cpumask_intersects(p->cpus_ptr, cpumask))
> +					continue;
> +
> +				has_fitting_cpu = true;
> +				break;
> +			}
> +		}

What happen when we hotplug out all CPUs of one CPU capacity value?
IMHO, we don't call asym_cpu_capacity_scan() with !new_topology
(partition_sched_domains_locked()).

> +		rcu_read_unlock();
> +
> +		if (!has_fitting_cpu)
> +			goto out;
>  	}
>  
>  	/*
> @@ -5083,6 +5127,9 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
>  	 * task_h_load() returns 0.
>  	 */
>  	rq->misfit_task_load = max_t(unsigned long, task_h_load(p), 1);
> +	return;
> +out:
> +	rq->misfit_task_load = 0;
>  }
>  
>  #else /* CONFIG_SMP */
> @@ -9583,9 +9630,7 @@ check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
>   */
>  static inline int check_misfit_status(struct rq *rq, struct sched_domain *sd)
>  {
> -	return rq->misfit_task_load &&
> -		(arch_scale_cpu_capacity(rq->cpu) < rq->rd->max_cpu_capacity ||
> -		 check_cpu_capacity(rq, sd));
> +	return rq->misfit_task_load && check_cpu_capacity(rq, sd);

You removed 'arch_scale_cpu_capacity(rq->cpu) <
rq->rd->max_cpu_capacity' here. Why? I can see that with the standard
setup (max CPU capacity equal 1024) which is what we probably use 100%
of the time now. It might get useful again when Vincent will introduce
his 'user space system pressure' implementation?

>  }

[...]

> @@ -1423,8 +1418,8 @@ static void asym_cpu_capacity_scan(void)
>  
>  	list_for_each_entry_safe(entry, next, &asym_cap_list, link) {
>  		if (cpumask_empty(cpu_capacity_span(entry))) {
> -			list_del(&entry->link);
> -			kfree(entry);
> +			list_del_rcu(&entry->link);
> +			call_rcu(&entry->rcu, free_asym_cap_entry);

Looks like there could be brief moments in which one CPU capacity group
of CPUs could be twice in asym_cap_list. I'm thinking about initial
startup + max CPU frequency related adjustment of CPU capacity
(init_cpu_capacity_callback()) for instance. Not sure if this is really
an issue?

[...]