linux-kernel - Re: [PATCH] sched_ext: Introduce NUMA awareness to the default idle selection policy

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Zxqc7jI8USKFX9-p@slm.duckdns.org>
Date: Thu, 24 Oct 2024 09:15:58 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched_ext: Introduce NUMA awareness to the default idle
 selection policy

Hello, Anrea.

On Thu, Oct 24, 2024 at 10:36:15AM +0200, Andrea Righi wrote:
> Similarly to commit dfa4ed29b18c ("sched_ext: Introduce LLC awareness to
> the default idle selection policy"), extend the built-in idle CPU
> selection policy to also prioritize CPUs within the same NUMA node.
> 
> With this change applied, the built-in CPU idle selection policy follows
> this logic:
>  - always prioritize CPUs from fully idle SMT cores,
>  - select the same CPU if possible,
>  - select a CPU within the same LLC domain,
>  - select a CPU within the same NUMA node.
> 
> Note that LLC and NUMA awareness optimizations are only applied when
> CONFIG_SCHED_MC is enabled.

The idea generally looks good to me but I have a couple quibbles about the
implementation.

...
> +enum scx_domain_type {
> +	SCX_DOM_LLC,		/* Use the same last-level cache (LLC) */
> +	SCX_DOM_NUMA,		/* Use the same NUMA node */
> +};
> +
>  #ifdef CONFIG_SCHED_MC
>  /*
> - * Return the cpumask of CPUs usable by task @p in the same LLC domain of @cpu,
> - * or NULL if the LLC domain cannot be determined.
> + * Return the cpumask of CPUs usable by task @p in the same domain of @cpu, or
> + * NULL if the domain cannot be determined.
>   */
> -static const struct cpumask *llc_domain(const struct task_struct *p, s32 cpu)
> +static const struct cpumask *
> +scx_domain(const struct task_struct *p, s32 cpu, enum scx_domain_type type)
>  {
> -	struct sched_domain *sd = rcu_dereference(per_cpu(sd_llc, cpu));
> -	const struct cpumask *llc_cpus = sd ? sched_domain_span(sd) : NULL;
> +	struct sched_domain *sd;
>  
>  	/*
> -	 * Return the LLC domain only if the task is allowed to run on all
> -	 * CPUs.
> -	 */
> -	return p->nr_cpus_allowed == nr_cpu_ids ? llc_cpus : NULL;
> +	 * Determine the scheduling domain only if the task is allowed to run
> +	 * on all CPUs.
> +	 *
> +	 * This is done primarily for efficiency, as it avoids the overhead of
> +	 * updating a cpumask every time we need to select an idle CPU (which
> +	 * can be costly in large SMP systems), but it also aligns logically:
> +	 * if a task's scheduling domain is restricted by user-space (through
> +	 * CPU affinity), the task will simply use the flat scheduling domain
> +	 * defined by user-space.
> +	 */
> +	if (p->nr_cpus_allowed < nr_cpu_ids)
> +		return NULL;
> +
> +	switch (type) {
> +	case SCX_DOM_LLC:
> +		sd = rcu_dereference(per_cpu(sd_llc, cpu));
> +		break;
> +	case SCX_DOM_NUMA:
> +		sd = rcu_dereference(per_cpu(sd_numa, cpu));
> +		break;
> +	default:
> +		WARN_ON_ONCE(1);
> +		sd = NULL;
> +	}
> +	if (!sd)
> +		return NULL;
> +
> +	return sched_domain_span(sd);
>  }
>  #else /* CONFIG_SCHED_MC */
> -static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
> +static const struct cpumask *
> +scx_domain(const struct task_struct *p, s32 cpu, enum scx_domain_type type)
>  {
>  	return NULL;
>  }
>  #endif /* CONFIG_SCHED_MC */
...
> @@ -3156,7 +3210,8 @@ static inline const struct cpumask *llc_domain(struct task_struct *p, s32 cpu)
>  static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  			      u64 wake_flags, bool *found)
>  {
> -	const struct cpumask *llc_cpus = llc_domain(p, prev_cpu);
> +	const struct cpumask *llc_cpus = scx_domain(p, prev_cpu, SCX_DOM_LLC);
> +	const struct cpumask *numa_cpus = scx_domain(p, prev_cpu, SCX_DOM_NUMA);

This feels like a lot of code which can just be:

	const struct cpumask *llc_cpus = NULL, *numa_cpus = NULL;

#ifdef CONFIG_SCHED_MC
	llc_cpus = rcu_dereference(per_cpu(sd_llc, cpu));
	numa_cpus = rcu_dereference(per_cpu(sd_numa, cpu));
#endif

>  	s32 cpu;
>  
>  	*found = false;
> @@ -3226,6 +3281,15 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  				goto cpu_found;
>  		}
>  
> +		/*
> +		 * Search for any fully idle core in the same NUMA node.
> +		 */
> +		if (numa_cpus) {
> +			cpu = scx_pick_idle_cpu(numa_cpus, SCX_PICK_IDLE_CORE);
> +			if (cpu >= 0)
> +				goto cpu_found;
> +		}

I'm not convinced about the argument that always doing extra pick is
beneficial. Sure, the overhead is minimal but isn't it also trivial to avoid
by just testing llc_cpus == numa_cpus (they resolve to the same cpumasks on
non-NUMA machines, right)? Taking a step further, the topology information
is really static and can be determined during boot. Wouldn't it make more
sense to just skip the unnecessary steps depending on topology? I'm not sure
the difference would be measurable but if you care we can make them
static_keys too.

Thanks.

-- 
tejun