linux-kernel - Re: [PATCH] sched,topology: Update sched topology atomically

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <xhsmhczhrwm9z.mognet@vschneid.remote.csb>
Date:   Fri, 08 Apr 2022 17:52:08 +0100
From:   Valentin Schneider <vschneid@...hat.com>
To:     Huang Ying <ying.huang@...el.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     linux-kernel@...r.kernel.org, Huang Ying <ying.huang@...el.com>,
        Valentin Schneider <valentin.schneider@....com>,
        Ingo Molnar <mingo@...hat.com>, Mel Gorman <mgorman@...e.de>,
        Rik van Riel <riel@...riel.com>,
        Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
Subject: Re: [PATCH] sched,topology: Update sched topology atomically


Hi,

On 06/04/22 15:01, Huang Ying wrote:
> When Peter Zijlstra reviewed commit 0fb3978b0aac ("sched/numa: Fix
> NUMA topology for systems with CPU-less nodes") [1], he pointed out
> that sched_domains_numa_distance and sched_domains_numa_masks are made
> separate RCU variables.  That could go side-ways if there were a
> function using both, although there isn't for now.
>
> So we update sched_domains_numa_distance and sched_domains_numa_masks
> and some other related sched topology parameters atomically to address
> the potential issues.
>
> [1] https://lkml.kernel.org/r/20220214121553.582248-1-ying.huang@intel.com
>

I had to re-acquaint myself with [1], but AFAICT this looks mostly
OK. Have some nits/comments below, I haven't actually tested this yet as
I'm still setting stuff up over here...

> Signed-off-by: "Huang, Ying" <ying.huang@...el.com>
> Suggested-by: Peter Zijlstra <peterz@...radead.org>
> Cc: Valentin Schneider <valentin.schneider@....com>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Rik van Riel <riel@...riel.com>
> Cc: Srikar Dronamraju <srikar@...ux.vnet.ibm.com>
> ---
>  kernel/sched/fair.c     |  47 ++++++---
>  kernel/sched/sched.h    |  14 ++-
>  kernel/sched/topology.c | 223 +++++++++++++++++++---------------------
>  3 files changed, 151 insertions(+), 133 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d4bd299d67ab..546e7347fafc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1294,17 +1294,22 @@ static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
>                                       int lim_dist, bool task)
>  {
>       unsigned long score = 0;
> -	int node, max_dist;
> +	int node, max_dist, type;
> +	struct sched_numa_topology *topology;
> +
> +	rcu_read_lock();
> +	topology = rcu_dereference(sched_numa_topology);
> +	type = topology->type;
> +	max_dist = topology->max_distance;
> +	rcu_read_unlock();
>

One thing worth pointing out, the NUMA topology type *wasn't* RCU protected
previously, which I think we've been wanting since

  0083242c9375 ("sched/topology: Skip updating masks for non-online nodes")

(before then the topology was init'd once-and-for-all).

Another thing; this pattern is now repeated a handful of times in fair.c,
what about bundling it up? As a bonus it lets us keep the definition of
struct sched_numa_topology internal to topology.c:

void sched_fetch_numa_topology(int *type int *max_distance)
{
        struct sched_numa_topology __rcu *topology;

        rcu_read_lock();
        topology = rcu_dereference(sched_numa_topology);

        if (type)
                *type = topology->type;

        if (max_distance)
                *max_distance = topology->max_distance;


        rcu_read_unlock();
}

> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 810750e62118..19523b23034f 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1483,14 +1483,10 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
>  }
>
>  #ifdef CONFIG_NUMA
> -enum numa_topology_type sched_numa_topology_type;
> +struct sched_numa_topology	default_numa_topology;

Shouldn't this be static?

> +struct sched_numa_topology	*sched_numa_topology = &default_numa_topology;

This should also be __rcu (sparse should complain about this)

>
> -static int			sched_domains_numa_levels;
>  static int			sched_domains_curr_level;
> -
> -int				sched_max_numa_distance;
> -static int			*sched_domains_numa_distance;
> -static struct cpumask		***sched_domains_numa_masks;
>  #endif
>
>  /*

> +static void sched_numa_topology_switch(struct sched_numa_topology *topology,
> +				       struct sched_domain_topology_level *tl)
> +{
> +	struct sched_numa_topology *old_topology = sched_numa_topology;
> +	struct sched_domain_topology_level *old_tl = sched_domain_topology;
> +
> +	rcu_assign_pointer(sched_numa_topology, topology);
> +	sched_domain_topology = tl;
> +
> +	if (old_topology == &default_numa_topology)
> +		return;
> +
> +	init_rcu_head(&old_topology->rcu);

I'm not familiar with this and I can find only a handful of users, should
we be doing that for struct sched_domain?

> +	call_rcu(&old_topology->rcu, sched_numa_topology_free_rcu);
> +	kfree(old_tl);
> +}
>
>  #define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
>
>  void sched_init_numa(int offline_node)
>  {
>       struct sched_domain_topology_level *tl;
> -	unsigned long *distance_map;
> +	unsigned long *distance_map = NULL;
>       int nr_levels = 0;
>       int i, j;
> -	int *distances;
> -	struct cpumask ***masks;
> +	struct sched_numa_topology *topology;
> +
> +	topology = kzalloc(sizeof(*topology), GFP_KERNEL);
> +	if (!topology)
> +		goto err;
>
>       /*
>        * O(nr_nodes^2) deduplicating selection sort -- in order to find the

> +	return;
> +err:

I was expecting the "usual" multi-label setup, but this seems to be working
too - though not in the most obvious way ;)

> +	bitmap_free(distance_map);
> +	sched_numa_topology_free(topology);
> +	sched_numa_topology_switch(&default_numa_topology,
> +				   sched_domain_topology_default);
>  }
>
>  /*
> @@ -1981,7 +1977,6 @@ void sched_update_numa(int cpu, bool online)
>       if (cpumask_weight(cpumask_of_node(node)) != 1)
>               return;
>
> -	sched_reset_numa();
>       sched_init_numa(online ? NUMA_NO_NODE : node);
>  }
>
> @@ -1990,14 +1985,15 @@ void sched_domains_numa_masks_set(unsigned int cpu)
>       int node = cpu_to_node(cpu);
>       int i, j;
>
> -	for (i = 0; i < sched_domains_numa_levels; i++) {
> +	for (i = 0; i < sched_numa_topology->nr_levels; i++) {
>               for (j = 0; j < nr_node_ids; j++) {
>                       if (!node_state(j, N_CPU))
>                               continue;
>
>                       /* Set ourselves in the remote node's masks */
> -			if (node_distance(j, node) <= sched_domains_numa_distance[i])
> -				cpumask_set_cpu(cpu, sched_domains_numa_masks[i][j]);
> +			if (node_distance(j, node) <=
> +			    sched_numa_topology->distances[i])
> +				cpumask_set_cpu(cpu, sched_numa_topology->masks[i][j]);

I'm guessing the assumption here is that this is done in
sched_cpu_{de,}activate(), so we have the hotplug lock and so the masks are
stable... Should we sprinkle in a lockdep_assert_cpus_held() for good
measure?

Having a second look at this, I think the same applies to your change to
sd_numa_mask() - those end up used in the domain rebuild which itself might
be deferred cpuset_hotplug_work, but that should still hold the hotplug
lock...

>               }
>       }
>  }
> @@ -2006,10 +2002,10 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
>  {
>       int i, j;
>
> -	for (i = 0; i < sched_domains_numa_levels; i++) {
> +	for (i = 0; i < sched_numa_topology->nr_levels; i++) {
>               for (j = 0; j < nr_node_ids; j++) {
> -			if (sched_domains_numa_masks[i][j])
> -				cpumask_clear_cpu(cpu, sched_domains_numa_masks[i][j]);
> +			if (sched_numa_topology->masks[i][j])
> +				cpumask_clear_cpu(cpu, sched_numa_topology->masks[i][j]);
>               }
>       }
>  }
> @@ -2025,22 +2021,19 @@ void sched_domains_numa_masks_clear(unsigned int cpu)
>  int sched_numa_find_closest(const struct cpumask *cpus, int cpu)
>  {
>       int i, j = cpu_to_node(cpu), found = nr_cpu_ids;
> -	struct cpumask ***masks;
> +	struct sched_numa_topology *topology;
>
>       rcu_read_lock();
> -	masks = rcu_dereference(sched_domains_numa_masks);
> -	if (!masks)
> -		goto unlock;
> -	for (i = 0; i < sched_domains_numa_levels; i++) {
> -		if (!masks[i][j])
> +	topology = rcu_dereference(sched_numa_topology);
> +	for (i = 0; i < topology->nr_levels; i++) {
> +		if (!topology->masks[i][j])

If I got default_numa_topology right and it's meant to be static,
sched_numa_topology->masks can still be NULL, but here this would be gated
by nr_levels being 0... Did I get that right?

>                       break;
> -		cpu = cpumask_any_and(cpus, masks[i][j]);
> +		cpu = cpumask_any_and(cpus, topology->masks[i][j]);
>               if (cpu < nr_cpu_ids) {
>                       found = cpu;
>                       break;
>               }
>       }
> -unlock:
>       rcu_read_unlock();
>
>       return found;
> --
> 2.30.2