linux-kernel - Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2ce24cae-12fc-4a28-8396-c5b46a5f76d3@oracle.com>
Date: Tue, 22 Apr 2025 15:20:41 -0700
From: Libo Chen <libo.chen@...cle.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: kprateek.nayak@....com, raghavendra.kt@....com, tim.c.chen@...el.com,
        vineethr@...ux.ibm.com, chris.hyser@...cle.com,
        daniel.m.jordan@...cle.com, lorenzo.stoakes@...cle.com,
        mkoutny@...e.com, Dhaval.Giani@....com, cgroups@...r.kernel.org,
        linux-kernel@...r.kernel.org, mingo@...hat.com, mgorman@...e.de,
        vincent.guittot@...aro.org, rostedt@...dmis.org, llong@...hat.com,
        akpm@...ux-foundation.org, tj@...nel.org, juri.lelli@...hat.com,
        peterz@...radead.org, yu.chen.surf@...mail.com
Subject: Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to
 one NUMA node via cpuset.mems

Hi Yu

On 4/19/25 04:16, Chen, Yu C wrote:
> Hi Libo,
> 
> On 4/18/2025 3:15 AM, Libo Chen wrote:
>> When the memory of the current task is pinned to one NUMA node by cgroup,
>> there is no point in continuing the rest of VMA scanning and hinting page
>> faults as they will just be overhead. With this change, there will be no
>> more unnecessary PTE updates or page faults in this scenario.
>>
>> We have seen up to a 6x improvement on a typical java workload running on
>> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
>> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
>> platform, we have seen 20% improvment in a microbench that creates a
>> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
>> pages in a fixed number of loops.
>>
>> Signed-off-by: Libo Chen <libo.chen@...cle.com>
> 
> I think this is a promising change that we can perform fine-grain NUMA
> balance control on a per-cgroup basis rather than system-wide NUMA
> balance for every task, which is costly.
> 

Yes indeed, the cost, from we have seen, can be quite astonishing 

>> ---
>>   kernel/sched/fair.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e43993a4e5807..c9903b1b39487 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
>>       if (p->flags & PF_EXITING)
>>           return;
>>   +    /*
>> +     * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>> +     * no page can be migrated.
>> +     */
>> +    if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
>> +        return;
>> +
> 
> I found that you had a proposal in V1 to address Peter's concern[1]:
> Allow the task to be migrated to its preferred Node, even if the task's
> memory policy is restricted to 1 Node. In your previous proposal, only if the task's cpumask is bound to the same Node as its memory policy node, the NUMA balance scanning is skipped, because a cgroup usually binds its tasks and memory allocation policy to the same node. Not sure if that could be turned into:
> 
> If the task's memory policy node's CPU mask is a subset of the task's cpumask, the NUMA balance scan is allowed.
> 

I guess fundamentally is this really worth it? Do the benefits of NUMA task migrations only outweigh the overheads of VMA scanning, PTE updates and page faults etc? I suppose this is workload-dependent, but what about the best-case scenario? I think we probably need more data.  Also if we do that, we also need to do the same for other VMA skipping scenarios.

Thanks,
Libo 

> For example,
> Suppose p's memory is only allocated on node0, which contains CPU2, CPU3.
> 1. If p's CPU affinity is CPU0, CPU1, there is no need to do NUMA balancing scanning, because CPU0,1 are not in p's legitimate cpumask.
> 2. If p's CPU affinity is CPU3, there is no need to do NUMA balancing scanning. p is already on its preferred node.
> 3. But if p's CPU affinity is CPU2, CPU3, CPU6, the NUMA balancing scan should be allowed. Because it is possible to migrate p from CPU6 to either CPU2 or CPU3.
> 
> What I'm thinking of is something as follows(untested):
> if (cpusets_enabled() &&
>     nodes_weight(cpuset_current_mems_allowed) == 1 &&
>     !cpumask_subset(cpumask_of_node(cpuset_current_mems_allowed),
>             p->cpus_ptr))
>     return;
> 
> 
> I tested your patch on top of the latest sched/core,
> binding task CPU affinity to Node1 and memory allocation node on
> Node1:
> echo "8-15" > /sys/fs/cgroup/mytest/cpuset.cpus
> echo "1" > /sys/fs/cgroup/mytest/cpuset.mems
> cgexec -g cpuset:mytest ./run-mmtests.sh --no-monitor --config config-numa skip_scan
> 
> And it works as expected:
> # bpftrace numa_trace.bt
> 
> @sched_skip_cpuset_numa: 133
> 
> 
> thanks,
> Chenyu
> 
> [1] https://urldefense.com/v3/__https://lore.kernel.org/lkml/cde7af54-5481-499e-8a42-0111f555f2b1@oracle.com/__;!!ACWV5N9M2RV99hQ!OvO0A__dCkeSB4eze2TYZYDHWGg0ubi04-u8lW5NCQGRE6vZkCGahdWMzHtpKMMDSt-L1wCkM8ILMIP3YA$
> 
>>       if (!mm->numa_next_scan) {
>>           mm->numa_next_scan = now +
>>               msecs_to_jiffies(sysctl_numa_balancing_scan_delay);