linux-kernel - Re: [PATCH v3] sched/fair: Cache NUMA node statistics to avoid O(N) scanning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <a269c33c-eaa3-4a06-aa27-062273e2e1c4@amd.com>
Date: Tue, 27 Jan 2026 08:55:07 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Qiliang Yuan <realwujing@...il.com>
CC: <bsegall@...gle.com>, <dietmar.eggemann@....com>, <juri.lelli@...hat.com>,
	<linux-kernel@...r.kernel.org>, <mgorman@...e.de>, <mingo@...hat.com>,
	<peterz@...radead.org>, <rostedt@...dmis.org>, <vincent.guittot@...aro.org>,
	<vschneid@...hat.com>, <yuanql9@...natelecom.cn>
Subject: Re: [PATCH v3] sched/fair: Cache NUMA node statistics to avoid O(N)
 scanning

Hello Qiliang,

On 1/26/2026 4:32 PM, Qiliang Yuan wrote:
> Optimize update_numa_stats() by leveraging pre-calculated node
> statistics cached during the load balancing process. This reduces the
> complexity of NUMA balancing overhead from O(CPUs_per_node) to O(1)
> when statistics for the source node are fresh.
> 
> Signed-off-by: Qiliang Yuan <realwujing@...il.com>
> Signed-off-by: Qiliang Yuan <yuanql9@...natelecom.cn>
> ---

Missing a changelog and the performance numbers that justify this
change.

>  kernel/sched/fair.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 44 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e71302282671..070b61f65b6d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2094,6 +2094,17 @@ static inline int numa_idle_core(int idle_core, int cpu)
>   * borrows code and logic from update_sg_lb_stats but sharing a
>   * common implementation is impractical.
>   */
> +struct numa_stats_cache {
> +	unsigned long load;
> +	unsigned long runnable;
> +	unsigned long util;
> +	unsigned long nr_running;
> +	unsigned long capacity;
> +	unsigned long last_update;
> +};
> +
> +static struct numa_stats_cache node_stats_cache[MAX_NUMNODES];

MAX_NUMNODES is a very large value. Why do you need to have this
all up front and not dynamically allocate it during sched domain
build.

Speaking of sched domains, partitioning the system can make it
so that the NUMA domain is split across multiple partition which
makes these numbers partition specific. Tasks running in one
partition cannot use the cached values from another partition.

If there is really a noticeable benefit, I would suggest using
the previous method to cache it somewhere in the sched domain
hierarchy - but only if there is a noticeable benefit.

> +
>  static void update_numa_stats(struct task_numa_env *env,
>  			      struct numa_stats *ns, int nid,
>  			      bool find_idle)
> @@ -2104,6 +2115,24 @@ static void update_numa_stats(struct task_numa_env *env,
>  	ns->idle_cpu = -1;
>  
>  	rcu_read_lock();
> +	/*
> +	 * Algorithmic Optimization: Avoid O(N) scan by using cached stats.
> +	 * Only applicable for the source node where we don't need to find
> +	 * an idle CPU.
> +	 */
> +	if (!find_idle && nid == env->src_nid) {
> +		struct numa_stats_cache *cache = &node_stats_cache[nid];
> +
> +		if (time_before(jiffies, cache->last_update + msecs_to_jiffies(10))) {
> +			ns->load = READ_ONCE(cache->load);
> +			ns->runnable = READ_ONCE(cache->runnable);
> +			ns->util = READ_ONCE(cache->util);
> +			ns->nr_running = READ_ONCE(cache->nr_running);
> +			ns->compute_capacity = READ_ONCE(cache->capacity);

So READ_ONCE()/WRITE_ONCE() doesn't solve the issue I was highlighting
in the last version. Say the following happens:

    CPU0                                            CPU1
    ====                                            ====

  update_numa_stats()
    /* Working on current numa_stats_cache */
    ns->load = READ_ONCE(cache->load);
    ns->runnable = READ_ONCE(cache->runnable);
    ... interrupted                               update_sg_lb_stats()
    ...                                           ... updates the entire numa_stats_cache
    ...
    ns->util = READ_ONCE(cache->util); /* Sees new data. */


Can this cause an issue? If not, please highlight in the commit log why
it is not an issue. There can be cases where we see util > capacity,
util > runnable, etc. which might lead to incorrect calculations later
on.

> +			goto skip_scan;
> +		}
> +	}
> +
>  	for_each_cpu(cpu, cpumask_of_node(nid)) {
>  		struct rq *rq = cpu_rq(cpu);
>  
> @@ -2124,6 +2153,8 @@ static void update_numa_stats(struct task_numa_env *env,
>  			idle_core = numa_idle_core(idle_core, cpu);
>  		}
>  	}
> +
> +skip_scan:
>  	rcu_read_unlock();
>  
>  	ns->weight = cpumask_weight(cpumask_of_node(nid));
> @@ -10488,6 +10519,19 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  	if (sgs->group_type == group_overloaded)
>  		sgs->avg_load = (sgs->group_load * SCHED_CAPACITY_SCALE) /
>  				sgs->group_capacity;
> +
> +	/* Algorithmic Optimization: Cache node stats for O(1) NUMA lookups */
> +	if (env->sd->flags & SD_NUMA) {

Also you'll need to think about partitions.

-- 
Thanks and Regards,
Prateek