linux-kernel - Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <244cb537-7d43-4795-9cb6-fc10234c68a1@intel.com>
Date: Sat, 19 Apr 2025 19:16:38 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Libo Chen <libo.chen@...cle.com>
CC: <kprateek.nayak@....com>, <raghavendra.kt@....com>,
	<tim.c.chen@...el.com>, <vineethr@...ux.ibm.com>, <chris.hyser@...cle.com>,
	<daniel.m.jordan@...cle.com>, <lorenzo.stoakes@...cle.com>,
	<mkoutny@...e.com>, <Dhaval.Giani@....com>, <cgroups@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <mingo@...hat.com>, <mgorman@...e.de>,
	<vincent.guittot@...aro.org>, <rostedt@...dmis.org>, <llong@...hat.com>,
	<akpm@...ux-foundation.org>, <tj@...nel.org>, <juri.lelli@...hat.com>,
	<peterz@...radead.org>, <yu.chen.surf@...mail.com>
Subject: Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to
 one NUMA node via cpuset.mems

Hi Libo,

On 4/18/2025 3:15 AM, Libo Chen wrote:
> When the memory of the current task is pinned to one NUMA node by cgroup,
> there is no point in continuing the rest of VMA scanning and hinting page
> faults as they will just be overhead. With this change, there will be no
> more unnecessary PTE updates or page faults in this scenario.
> 
> We have seen up to a 6x improvement on a typical java workload running on
> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
> platform, we have seen 20% improvment in a microbench that creates a
> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
> pages in a fixed number of loops.
> 
> Signed-off-by: Libo Chen <libo.chen@...cle.com>

I think this is a promising change that we can perform fine-grain NUMA
balance control on a per-cgroup basis rather than system-wide NUMA
balance for every task, which is costly.

> ---
>   kernel/sched/fair.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e5807..c9903b1b39487 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
>   	if (p->flags & PF_EXITING)
>   		return;
>   
> +	/*
> +	 * Memory is pinned to only one NUMA node via cpuset.mems, naturally
> +	 * no page can be migrated.
> +	 */
> +	if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
> +		return;
> +

I found that you had a proposal in V1 to address Peter's concern[1]:
Allow the task to be migrated to its preferred Node, even if the task's
memory policy is restricted to 1 Node. In your previous proposal, only 
if the task's cpumask is bound to the same Node as its memory policy 
node, the NUMA balance scanning is skipped, because a cgroup usually 
binds its tasks and memory allocation policy to the same node. Not sure 
if that could be turned into:

If the task's memory policy node's CPU mask is a subset of the task's 
cpumask, the NUMA balance scan is allowed.

For example,
Suppose p's memory is only allocated on node0, which contains CPU2, CPU3.
1. If p's CPU affinity is CPU0, CPU1, there is no need to do NUMA 
balancing scanning, because CPU0,1 are not in p's legitimate cpumask.
2. If p's CPU affinity is CPU3, there is no need to do NUMA balancing 
scanning. p is already on its preferred node.
3. But if p's CPU affinity is CPU2, CPU3, CPU6, the NUMA balancing scan 
should be allowed. Because it is possible to migrate p from CPU6 to 
either CPU2 or CPU3.

What I'm thinking of is something as follows(untested):
if (cpusets_enabled() &&
     nodes_weight(cpuset_current_mems_allowed) == 1 &&
     !cpumask_subset(cpumask_of_node(cpuset_current_mems_allowed),
		    p->cpus_ptr))
	return;


I tested your patch on top of the latest sched/core,
binding task CPU affinity to Node1 and memory allocation node on
Node1:
echo "8-15" > /sys/fs/cgroup/mytest/cpuset.cpus
echo "1" > /sys/fs/cgroup/mytest/cpuset.mems
cgexec -g cpuset:mytest ./run-mmtests.sh --no-monitor --config 
config-numa skip_scan

And it works as expected:
# bpftrace numa_trace.bt

@sched_skip_cpuset_numa: 133


thanks,
Chenyu

[1] 
https://lore.kernel.org/lkml/cde7af54-5481-499e-8a42-0111f555f2b1@oracle.com/


>   	if (!mm->numa_next_scan) {
>   		mm->numa_next_scan = now +
>   			msecs_to_jiffies(sysctl_numa_balancing_scan_delay);