linux-kernel - Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z0aFZJZ85LUwapZE@yury-ThinkPad>
Date: Tue, 26 Nov 2024 18:35:16 -0800
From: Yury Norov <yury.norov@...il.com>
To: Andrea Righi <arighi@...dia.com>
Cc: Tejun Heo <tj@...nel.org>, David Vernet <void@...ifault.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] sched_ext: Introduce per-NUMA idle cpumasks

On Tue, Nov 26, 2024 at 10:56:40AM +0100, Andrea Righi wrote:
> Using a single global idle mask can lead to inefficiencies and a lot of
> stress on the cache coherency protocol on large systems with multiple
> NUMA nodes, since all the CPUs can create a really intense read/write
> activity on the single global cpumask.
> 
> Therefore, split the global cpumask into multiple per-NUMA node cpumasks
> to improve scalability and performance on large systems.
> 
> The concept is that each cpumask will track only the idle CPUs within
> its corresponding NUMA node, treating CPUs in other NUMA nodes as busy.
> In this way concurrent access to the idle cpumask will be restricted
> within each NUMA node.

This all is definitely a complication of the code. Have you any numbers to
justify it?
 
> [Open issue]
> 
> The scx_bpf_get_idle_cpu/smtmask() kfunc's, that are supposed to return
> a single cpumask for all the CPUs, have been changed to report only the
> cpumask of the current NUMA node (using the current CPU); this breaks
> the old behavior, so it can potentially introduce regressions in some
> scx schedulers.
> 
> An alternative approach could be to construct a global cpumask
> on-the-fly, but this could add significant overhead to ops.select_cpu()
> for schedulers relying on these kfunc's. Additionally, it would be less
> reliable than accessing the actual cpumasks, as the copy could quickly
> become out of sync and not represent the actual idle state very well.
> 
> Probably a better way to solve this issue is to introduce new kfunc's to
> explicitly select specific per-NUMA cpumask and modify the scx
> schedulers to transition to this new API, for example:
> 
>   const struct cpumask *scx_bpf_get_idle_numa_cpumask(int node)
>   const struct cpumask *scx_bpf_get_idle_numa_smtmask(int node)
> 
> Signed-off-by: Andrea Righi <arighi@...dia.com>
> ---
>  kernel/sched/ext.c | 110 ++++++++++++++++++++++++++++++++-------------
>  1 file changed, 78 insertions(+), 32 deletions(-)
> 
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> index 3c4a94e4258f..ea2fae2db10e 100644
> --- a/kernel/sched/ext.c
> +++ b/kernel/sched/ext.c
> @@ -933,7 +933,37 @@ static struct delayed_work scx_watchdog_work;
>  static struct {
>  	cpumask_var_t cpu;
>  	cpumask_var_t smt;
> -} idle_masks CL_ALIGNED_IF_ONSTACK;
> +} **idle_masks CL_ALIGNED_IF_ONSTACK;
> +
> +static struct cpumask *get_idle_cpumask(int cpu)
> +{
> +	int node = cpu_to_node(cpu);
> +
> +	return idle_masks[node]->cpu;
> +}
> +
> +static struct cpumask *get_idle_smtmask(int cpu)
> +{
> +	int node = cpu_to_node(cpu);
> +
> +	return idle_masks[node]->smt;
> +}
> +
> +static void idle_masks_init(void)
> +{
> +	int node;
> +
> +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);
> +	BUG_ON(!idle_masks);
> +
> +	for_each_node_state(node, N_POSSIBLE) {
> +		idle_masks[node] = kzalloc_node(sizeof(**idle_masks), GFP_KERNEL, node);
> +		BUG_ON(!idle_masks[node]);
> +
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->cpu, GFP_KERNEL, node));
> +		BUG_ON(!alloc_cpumask_var_node(&idle_masks[node]->smt, GFP_KERNEL, node));
> +	}
> +}
>  
>  #endif	/* CONFIG_SMP */
>  
> @@ -3156,39 +3186,48 @@ static bool test_and_clear_cpu_idle(int cpu)
>  	 */
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask(cpu);
>  
>  		/*
>  		 * If offline, @cpu is not its own sibling and
>  		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> -		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> -		 * is eventually cleared.
> +		 * @cpu is never cleared from the idle SMT mask. Ensure that
> +		 * @cpu is eventually cleared.
>  		 */
> -		if (cpumask_intersects(smt, idle_masks.smt))
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> -		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> -			__cpumask_clear_cpu(cpu, idle_masks.smt);
> +		if (cpumask_intersects(smt, idle_smt))
> +			cpumask_andnot(idle_smt, idle_smt, smt);

cpumask_andnot() is a no-op if masks don't intersect.
cpumask_intersects() is O(N), and you can safely drop it and
save some cycles.

> +		else if (cpumask_test_cpu(cpu, idle_smt))
> +			__cpumask_clear_cpu(cpu, idle_smt);

Same here. If the CPU is clear, __cpumask_clear_cpu() is a no-op.
I feel like you can just clear all that CPUs unconditionally. In
the worst case, you'll clear them twice, which is harmless. Or I
misunderstand something?

>  	}
>  #endif
> -	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> +	return cpumask_test_and_clear_cpu(cpu, get_idle_cpumask(cpu));
>  }
>  
>  static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
>  {
> -	int cpu;
> +	int start = cpu_to_node(smp_processor_id());
> +	int node, cnt, cpu;
>  
>  retry:
>  	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> -		if (cpu < nr_cpu_ids)
> -			goto found;
> +		for_each_node_state_from(node, N_ONLINE, start, cnt) {
> +			cpu = cpumask_any_and_distribute(idle_masks[node]->smt,
> +							 cpus_allowed);

Nit: no need to break the line here.

> +			if (cpu < nr_cpu_ids)
> +				goto found;
> +		}
>  
>  		if (flags & SCX_PICK_IDLE_CORE)
>  			return -EBUSY;
>  	}
>  
> -	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> -	if (cpu >= nr_cpu_ids)
> -		return -EBUSY;
> +	for_each_node_state_from(node, N_ONLINE, start, cnt) {
> +		cpu = cpumask_any_and_distribute(idle_masks[node]->cpu,
> +						 cpus_allowed);
> +		if (cpu < nr_cpu_ids)
> +			goto found;
> +	}
> +	return -EBUSY;
>  
>  found:
>  	if (test_and_clear_cpu_idle(cpu))
> @@ -3401,7 +3440,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		 * piled up on it even if there is an idle core elsewhere on
>  		 * the system.
>  		 */
> -		if (!cpumask_empty(idle_masks.cpu) &&
> +		if (!cpumask_empty(get_idle_cpumask(cpu)) &&
>  		    !(current->flags & PF_EXITING) &&
>  		    cpu_rq(cpu)->scx.local_dsq.nr == 0) {
>  			if (cpumask_test_cpu(cpu, p->cpus_ptr))

cpumask_empty() is O(N), and the other checks are O(1). If you reorder
them such that cpumask_empty() would be the last check, you can
increase probability that you will not check it at all.

> @@ -3417,7 +3456,7 @@ static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu,
>  		/*
>  		 * Keep using @prev_cpu if it's part of a fully idle core.
>  		 */
> -		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> +		if (cpumask_test_cpu(prev_cpu, get_idle_smtmask(prev_cpu)) &&
>  		    test_and_clear_cpu_idle(prev_cpu)) {
>  			cpu = prev_cpu;
>  			goto cpu_found;
> @@ -3560,12 +3599,18 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  
>  static void reset_idle_masks(void)
>  {
> +	int node;
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
>  	 */
> -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> +	for_each_node_state(node, N_POSSIBLE) {
> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +
> +		cpumask_and(idle_masks[node]->cpu, cpu_online_mask, node_mask);
> +		cpumask_and(idle_masks[node]->smt, cpu_online_mask, node_mask);

cpumask_copy(idle_masks[node]->smt, idle_masks[node]->cpu)

> +	}
>  }
>  
>  void __scx_update_idle(struct rq *rq, bool idle)
> @@ -3579,13 +3624,15 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  	}
>  
>  	if (idle)
> -		cpumask_set_cpu(cpu, idle_masks.cpu);
> +		cpumask_set_cpu(cpu, get_idle_cpumask(cpu));
>  	else
> -		cpumask_clear_cpu(cpu, idle_masks.cpu);
> +		cpumask_clear_cpu(cpu, get_idle_cpumask(cpu));

assign_cpu(cpu, get_idle_cpumask(cpu), idle)

Thanks,
Yury

>  
>  #ifdef CONFIG_SCHED_SMT
>  	if (sched_smt_active()) {
>  		const struct cpumask *smt = cpu_smt_mask(cpu);
> +		struct cpumask *idle_cpu = get_idle_cpumask(cpu);
> +		struct cpumask *idle_smt = get_idle_smtmask(cpu);
>  
>  		if (idle) {
>  			/*
> @@ -3593,12 +3640,12 @@ void __scx_update_idle(struct rq *rq, bool idle)
>  			 * it's only for optimization and self-correcting.
>  			 */
>  			for_each_cpu(cpu, smt) {
> -				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> +				if (!cpumask_test_cpu(cpu, idle_cpu))
>  					return;
>  			}
> -			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_or(idle_smt, idle_smt, smt);
>  		} else {
> -			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> +			cpumask_andnot(idle_smt, idle_smt, smt);
>  		}
>  	}
>  #endif
> @@ -6174,8 +6221,7 @@ void __init init_sched_ext_class(void)
>  
>  	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
>  #ifdef CONFIG_SMP
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> -	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> +	idle_masks_init();
>  #endif
>  	scx_kick_cpus_pnt_seqs =
>  		__alloc_percpu(sizeof(scx_kick_cpus_pnt_seqs[0]) * nr_cpu_ids,
> @@ -7321,7 +7367,7 @@ __bpf_kfunc void scx_bpf_put_cpumask(const struct cpumask *cpumask)
>  
>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask.
> + * per-CPU cpumask of the current NUMA node.
>   *
>   * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
>   */
> @@ -7333,7 +7379,7 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  	}
>  
>  #ifdef CONFIG_SMP
> -	return idle_masks.cpu;
> +	return get_idle_cpumask(smp_processor_id());
>  #else
>  	return cpu_none_mask;
>  #endif
> @@ -7341,8 +7387,8 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_cpumask(void)
>  
>  /**
>   * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> - * per-physical-core cpumask. Can be used to determine if an entire physical
> - * core is free.
> + * per-physical-core cpumask of the current NUMA node. Can be used to determine
> + * if an entire physical core is free.
>   *
>   * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
>   */
> @@ -7355,9 +7401,9 @@ __bpf_kfunc const struct cpumask *scx_bpf_get_idle_smtmask(void)
>  
>  #ifdef CONFIG_SMP
>  	if (sched_smt_active())
> -		return idle_masks.smt;
> +		return get_idle_smtmask(smp_processor_id());
>  	else
> -		return idle_masks.cpu;
> +		return get_idle_cpumask(smp_processor_id());
>  #else
>  	return cpu_none_mask;
>  #endif
> -- 
> 2.47.0