linux-kernel - Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z2IHsuzeW5e7MAr6@slm.duckdns.org>
Date: Tue, 17 Dec 2024 13:22:26 -1000
From: Tejun Heo <tj@...nel.org>
To: Andrea Righi <arighi@...dia.com>
Cc: David Vernet <void@...ifault.com>, Changwoo Min <changwoo@...lia.com>,
	Yury Norov <yury.norov@...il.com>, Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 3/6] sched_ext: Introduce per-node idle cpumasks

On Tue, Dec 17, 2024 at 10:32:28AM +0100, Andrea Righi wrote:
> +static int validate_node(int node)
> +{
> +	/* If no node is specified, return the current one */
> +	if (node == NUMA_NO_NODE)
> +		return numa_node_id();
> +
> +	/* Make sure node is in the range of possible nodes */
> +	if (node < 0 || node >= num_possible_nodes())
> +		return -EINVAL;

Are node IDs guaranteed to be consecutive? Shouldn't it be `node >=
nr_node_ids`? Also, should probably add node_possible(node)?

> +/*
> + * cpumasks to track idle CPUs within each NUMA node.
> + *
> + * If SCX_OPS_BUILTIN_IDLE_PER_NODE is not specified, a single flat cpumask
> + * from node 0 is used to track all idle CPUs system-wide.
> + */
> +static struct idle_cpumask **idle_masks CL_ALIGNED_IF_ONSTACK;

As the masks are allocated separately anyway, the aligned attribute can be
dropped. There's no reason to align the index array.

> +static struct cpumask *get_idle_mask_node(int node, bool smt)
> +{
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node))
> +		return smt ? idle_masks[0]->smt : idle_masks[0]->cpu;
> +
> +	node = validate_node(node);

It's odd to validate input node in an internal function. If node is being
passed from BPF side, we should validate it and trigger scx_ops_error() if
invalid, but once the node number is inside the kernel, we should be able to
trust it.

> +static struct cpumask *get_idle_cpumask_node(int node)
> +{
> +	return get_idle_mask_node(node, false);

Maybe make the inner function return `struct idle_cpumasks *` so that the
caller can pick between cpu and smt?

> +static void idle_masks_init(void)
> +{
> +	int node;
> +
> +	idle_masks = kcalloc(num_possible_nodes(), sizeof(*idle_masks), GFP_KERNEL);

We probably want to use a variable name which is more qualified for a global
variable - scx_idle_masks?

> @@ -3173,6 +3245,9 @@ bool scx_prio_less(const struct task_struct *a, const struct task_struct *b,
>  
>  static bool test_and_clear_cpu_idle(int cpu)
>  {
> +	int node = cpu_to_node(cpu);
> +	struct cpumask *idle_cpu = get_idle_cpumask_node(node);

Can we use plurals for cpumask varialbles - idle_cpus here?

> -static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +static s32 scx_pick_idle_cpu_from_node(int node, const struct cpumask *cpus_allowed, u64 flags)

Do we need "from_node"?

>  {
>  	int cpu;
>  
>  retry:
>  	if (sched_smt_active()) {
> -		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> +		cpu = cpumask_any_and_distribute(get_idle_smtmask_node(node), cpus_allowed);

This too, would s/get_idle_smtmask_node(node)/idle_smtmask(node)/ work?
There are no node-unaware counterparts to these functions, right?

> +static s32
> +scx_pick_idle_cpu_numa(const struct cpumask *cpus_allowed, s32 prev_cpu, u64 flags)
> +{
> +	nodemask_t hop_nodes = NODE_MASK_NONE;
> +	int start_node = cpu_to_node(prev_cpu);
> +	s32 cpu = -EBUSY;
> +
> +	/*
> +	 * Traverse all online nodes in order of increasing distance,
> +	 * starting from prev_cpu's node.
> +	 */
> +	rcu_read_lock();

Is rcu_read_lock() necessary? Does lockdep warn if the explicit
rcu_read_lock() is dropped?

> @@ -3643,17 +3776,33 @@ static void set_cpus_allowed_scx(struct task_struct *p,
>  
>  static void reset_idle_masks(void)
>  {
> +	int node;
> +
> +	if (!static_branch_maybe(CONFIG_NUMA, &scx_builtin_idle_per_node)) {
> +		cpumask_copy(get_idle_cpumask_node(0), cpu_online_mask);
> +		cpumask_copy(get_idle_smtmask_node(0), cpu_online_mask);
> +		return;
> +	}
> +
>  	/*
>  	 * Consider all online cpus idle. Should converge to the actual state
>  	 * quickly.
>  	 */
> -	cpumask_copy(idle_masks.cpu, cpu_online_mask);
> -	cpumask_copy(idle_masks.smt, cpu_online_mask);
> +	for_each_node_state(node, N_POSSIBLE) {
> +		const struct cpumask *node_mask = cpumask_of_node(node);
> +		struct cpumask *idle_cpu = get_idle_cpumask_node(node);
> +		struct cpumask *idle_smt = get_idle_smtmask_node(node);
> +
> +		cpumask_and(idle_cpu, cpu_online_mask, node_mask);
> +		cpumask_copy(idle_smt, idle_cpu);

Can you do the same cpumask_and() here? I don't think it'll cause practical
problems but idle_cpus can be updated inbetween and e.g. we can end up with
idle_smts that have different idle states between siblings.

>  /**
>   * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> - * per-CPU cpumask.
> + * per-CPU cpumask of the current NUMA node.

This is a bit misleading as it can be system-wide too.

It's a bit confusing for scx_bpf_get_idle_cpu/smtmask() to return per-node
mask while scx_bpf_pick_idle_cpu() and friends are not scoped to the node.
Also, scx_bpf_pick_idle_cpu() picking the local node as the origin probably
doesn't make sense for most use cases as it's usually called from
ops.select_cpu() and the waker won't necessarily run on the same node as the
wakee.

Maybe disallow scx_bpf_get_idle_cpu/smtmask() if idle_per_node is enabled
and add scx_bpF_get_idle_cpu/smtmask_node()? Ditto for
scx_bpf_pick_idle_cpu() and we can add a PICK_IDLE flag to allow/inhibit
CPUs outside the specified node.

Thanks.

-- 
tejun