linux-kernel - Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to one NUMA node via cpuset.mems

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b3661d40-536e-4a07-8872-7c5ae5e1166e@intel.com>
Date: Wed, 23 Apr 2025 07:27:48 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Libo Chen <libo.chen@...cle.com>
CC: <kprateek.nayak@....com>, <raghavendra.kt@....com>,
	<tim.c.chen@...el.com>, <vineethr@...ux.ibm.com>, <chris.hyser@...cle.com>,
	<daniel.m.jordan@...cle.com>, <lorenzo.stoakes@...cle.com>,
	<mkoutny@...e.com>, <Dhaval.Giani@....com>, <cgroups@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <mingo@...hat.com>, <mgorman@...e.de>,
	<vincent.guittot@...aro.org>, <rostedt@...dmis.org>, <llong@...hat.com>,
	<akpm@...ux-foundation.org>, <tj@...nel.org>, <juri.lelli@...hat.com>,
	<peterz@...radead.org>, <yu.chen.surf@...mail.com>
Subject: Re: [PATCH v3 1/2] sched/numa: Skip VMA scanning on memory pinned to
 one NUMA node via cpuset.mems

On 4/23/2025 6:20 AM, Libo Chen wrote:
> Hi Yu
> 
> On 4/19/25 04:16, Chen, Yu C wrote:
>> Hi Libo,
>>
>> On 4/18/2025 3:15 AM, Libo Chen wrote:
>>> When the memory of the current task is pinned to one NUMA node by cgroup,
>>> there is no point in continuing the rest of VMA scanning and hinting page
>>> faults as they will just be overhead. With this change, there will be no
>>> more unnecessary PTE updates or page faults in this scenario.
>>>
>>> We have seen up to a 6x improvement on a typical java workload running on
>>> VMs with memory and CPU pinned to one NUMA node via cpuset in a two-socket
>>> AARCH64 system. With the same pinning, on a 18-cores-per-socket Intel
>>> platform, we have seen 20% improvment in a microbench that creates a
>>> 30-vCPU selftest KVM guest with 4GB memory, where each vCPU reads 4KB
>>> pages in a fixed number of loops.
>>>
>>> Signed-off-by: Libo Chen <libo.chen@...cle.com>
>>
>> I think this is a promising change that we can perform fine-grain NUMA
>> balance control on a per-cgroup basis rather than system-wide NUMA
>> balance for every task, which is costly.
>>
> 
> Yes indeed, the cost, from we have seen, can be quite astonishing
> 
>>> ---
>>>    kernel/sched/fair.c | 7 +++++++
>>>    1 file changed, 7 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e43993a4e5807..c9903b1b39487 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3329,6 +3329,13 @@ static void task_numa_work(struct callback_head *work)
>>>        if (p->flags & PF_EXITING)
>>>            return;
>>>    +    /*
>>> +     * Memory is pinned to only one NUMA node via cpuset.mems, naturally
>>> +     * no page can be migrated.
>>> +     */
>>> +    if (cpusets_enabled() && nodes_weight(cpuset_current_mems_allowed) == 1)
>>> +        return;
>>> +
>>
>> I found that you had a proposal in V1 to address Peter's concern[1]:
>> Allow the task to be migrated to its preferred Node, even if the task's
>> memory policy is restricted to 1 Node. In your previous proposal, only if the task's cpumask is bound to the same Node as its memory policy node, the NUMA balance scanning is skipped, because a cgroup usually binds its tasks and memory allocation policy to the same node. Not sure if that could be turned into:
>>
>> If the task's memory policy node's CPU mask is a subset of the task's cpumask, the NUMA balance scan is allowed.
>>
> 
> I guess fundamentally is this really worth it? Do the benefits of NUMA task migrations only outweigh the overheads of VMA scanning, PTE updates and page faults etc? I suppose this is workload-dependent, but what about the best-case scenario? I think we probably need more data.  Also if we do that, we also need to do the same for other VMA skipping scenarios.
> 

Overall that can be a future work and I agree for now this patch is 
simple enough and feel free to add:

Tested-by: Chen Yu <yu.c.chen@...el.com>

thanks,
Chenyu