[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <244cb537-7d43-4795-9cb6-fc10234c68a1@intel.com>
Date: Sat, 19 Apr 2025 19:16:38 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Libo Chen <libo.chen@...cle.com>
CC: <kprateek.nayak@....com>, <raghavendra.kt@....com>,
<tim.c.chen@...el.com>, <vineethr@...ux.ibm.com>, <chris.hyser@...cle.com>,
<daniel.m.jordan@...cle.com>, <lorenzo.stoakes@...cle.com>,
<mkoutny@...e.com>, <Dhaval.Giani@....com>, <cgroups@...r.kernel.org>,
<linux-kernel@...r.kernel.org>, <mingo@...hat.com>, <mgorman@...e.de>,
<vincent.guittot@...aro.org>, <rostedt@...dmis.org>, <llong@...hat.com>,
<akpm@...ux-foundation.org>, <tj@...nel.org>, <juri.lelli@...hat.com>,
<peterz@...radead.org>, <yu.chen.surf@...mail.com>
Subject: Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to
one NUMA node via cpuset.mems
Hi Libo,
On 4/18/2025 3:15 AM, Libo Chen wrote:
> When the memory of the current task is pinned to one NUMA node by cgroup,
> there is no point in continuing the rest of VMA scanning and hinting page
> faults as they will just be overhead. With this change, there will be no
> more unnecessary PTE updates or page faults in this scenario.
>
> We have seen up to a 6x improvement on a typical java workload running on
> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
> platform, we have seen 20% improvment in a microbench that creates a
> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
> pages in a fixed number of loops.
>
> Signed-off-by: Libo Chen <libo.chen@...cle.com>
I think this is a promising change that we can perform fine-grain NUMA
balance control on a per-cgroup basis rather than system-wide NUMA
balance for every task, which is costly.
> ---
> kernel/sched/fair.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e5807..c9903b1b39487 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
> if (p->flags & PF_EXITING)
> return;
>
> + /*
> + * Memory is pinned to only one NUMA node via cpuset.mems, naturally
> + * no page can be migrated.
> + */
> + if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
> + return;
> +
I found that you had a proposal in V1 to address Peter's concern[1]:
Allow the task to be migrated to its preferred Node, even if the task's
memory policy is restricted to 1 Node. In your previous proposal, only
if the task's cpumask is bound to the same Node as its memory policy
node, the NUMA balance scanning is skipped, because a cgroup usually
binds its tasks and memory allocation policy to the same node. Not sure
if that could be turned into:
If the task's memory policy node's CPU mask is a subset of the task's
cpumask, the NUMA balance scan is allowed.
For example,
Suppose p's memory is only allocated on node0, which contains CPU2, CPU3.
1. If p's CPU affinity is CPU0, CPU1, there is no need to do NUMA
balancing scanning, because CPU0,1 are not in p's legitimate cpumask.
2. If p's CPU affinity is CPU3, there is no need to do NUMA balancing
scanning. p is already on its preferred node.
3. But if p's CPU affinity is CPU2, CPU3, CPU6, the NUMA balancing scan
should be allowed. Because it is possible to migrate p from CPU6 to
either CPU2 or CPU3.
What I'm thinking of is something as follows(untested):
if (cpusets_enabled() &&
nodes_weight(cpuset_current_mems_allowed) == 1 &&
!cpumask_subset(cpumask_of_node(cpuset_current_mems_allowed),
p->cpus_ptr))
return;
I tested your patch on top of the latest sched/core,
binding task CPU affinity to Node1 and memory allocation node on
Node1:
echo "8-15" > /sys/fs/cgroup/mytest/cpuset.cpus
echo "1" > /sys/fs/cgroup/mytest/cpuset.mems
cgexec -g cpuset:mytest ./run-mmtests.sh --no-monitor --config
config-numa skip_scan
And it works as expected:
# bpftrace numa_trace.bt
@sched_skip_cpuset_numa: 133
thanks,
Chenyu
[1]
https://lore.kernel.org/lkml/cde7af54-5481-499e-8a42-0111f555f2b1@oracle.com/
> if (!mm->numa_next_scan) {
> mm->numa_next_scan = now +
> msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
Powered by blists - more mailing lists