linux-kernel - Re: [PATCH v1] mm/numa_balancing: Fix the memory thrashing problem in the single-threaded process

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <790f44b4-2cad-469f-adcb-aa1cb31ff802@bytedance.com>
Date: Wed, 24 Jul 2024 20:11:52 +0800
From: Abel Wu <wuyun.abel@...edance.com>
To: Zhongkun He <hezhongkun.hzk@...edance.com>
Cc: peterz@...radead.org, mgorman@...e.de, ying.huang@...el.com,
 linux-mm@...ck.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH v1] mm/numa_balancing: Fix the memory thrashing problem in
 the single-threaded process

On 7/24/24 11:55 AM, Zhongkun He Wrote:
> On Tue, Jul 23, 2024 at 9:39 PM Abel Wu <wuyun.abel@...edance.com> wrote:
>>
>> Hi Zhongkun,
>>
>> On 7/23/24 1:32 PM, Zhongkun He Wrote:
>>> I found a problem in my test machine that the memory of a process is
>>> repeatedly migrated between two nodes and does not stop.
>>>
>>> 1.Test step and the machines.
>>> ------------
>>> VM machine: 4 numa nodes and 10GB per node.
>>>
>>> stress --vm 1 --vm-bytes 12g --vm-keep
>>>
>>> The info of numa stat:
>>> while :;do cat memory.numa_stat | grep -w anon;sleep 5;done
>>> anon N0=98304 N1=0 N2=10250747904 N3=2634334208
>>
>> I am curious what was the exact reason made the worker migrated
>> to N3? And later...
> 
> The maximum capacity of each node is 10 GB， but it requires 12GB,
> so there's always 2G on other nodes. With the patch below we only
> have page_faults in other nodes, not local. so we will migrate pages
> to other nodes because p->numa_preferred_nid is always the other node.

Ahh sorry, I didn't notice the size of the node...

> 
>>
>>> anon N0=98304 N1=0 N2=10250747904 N3=2634334208
>>> anon N0=98304 N1=0 N2=9937256448 N3=2947825664
>>> anon N0=98304 N1=0 N2=8863514624 N3=4021567488
>>> anon N0=98304 N1=0 N2=7789772800 N3=5095309312
>>> anon N0=98304 N1=0 N2=6716030976 N3=6169051136
>>> anon N0=98304 N1=0 N2=5642289152 N3=7242792960
>>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
>>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
>>> anon N0=98304 N1=0 N2=4837007360 N3=8048074752
>>> anon N0=98304 N1=0 N2=3763265536 N3=9121816576
>>> anon N0=98304 N1=0 N2=2689523712 N3=10195558400
>>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
>>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
>>> anon N0=98304 N1=0 N2=2515148800 N3=10369933312
>>
>> .. why it was moved back to N2?
> 
> The private page_faults on N2 are larger than that on N3.
> 
>>
>>> anon N0=98304 N1=0 N2=3320455168 N3=9564626944
>>> anon N0=98304 N1=0 N2=4394196992 N3=8490885120
>>> anon N0=98304 N1=0 N2=5105442816 N3=7779639296
>>> anon N0=98304 N1=0 N2=6174195712 N3=6710886400
>>> anon N0=98304 N1=0 N2=7247937536 N3=5637144576
>>> anon N0=98304 N1=0 N2=8321679360 N3=4563402752
>>> anon N0=98304 N1=0 N2=9395421184 N3=3489660928
>>> anon N0=98304 N1=0 N2=10247872512 N3=2637209600
>>> anon N0=98304 N1=0 N2=10247872512 N3=2637209600
>>>
>>> 2. Root cause:
>>> Since commit 3e32158767b0 ("mm/mprotect.c: don't touch single threaded
>>> PTEs which are on the right node")the PTE of local pages will not be
>>> changed in change_pte_range() for single-threaded process, so no
>>> page_faults information will be generated in do_numa_page(). If a
>>> single-threaded process has memory on another node, it will
>>> unconditionally migrate all of it's local memory to that node,
>>> even if the remote node has only one page.
>>
>> IIUC the remote pages will be moved to the node where the worker
>> is running since local (private) PTEs are not set to protnone and
>> won't be faulted on.
>>
> 
> Yes.
> 
>>>
>>> So, let's fix it. The memory of single-threaded process should follow
>>> the cpu, not the numa faults info in order to avoid memory thrashing.
>>
>> Don't forget the 'Fixes' tag for bugfix patches :)
> 
> OK, thanks.
> 
>>
>>>
>>> ...>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 24dda708b699..d7cbbda568fb 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2898,6 +2898,12 @@ static void task_numa_placement(struct task_struct *p)
>>>                numa_group_count_active_nodes(ng);
>>>                spin_unlock_irq(group_lock);
>>>                max_nid = preferred_group_nid(p, max_nid);
>>> +     } else if (atomic_read(&p->mm->mm_users) == 1) {
>>> +             /*
>>> +              * The memory of a single-threaded process should
>>> +              * follow the CPU in order to avoid memory thrashing.
>>> +              */
>>> +             max_nid = numa_node_id();
>>>        }
>>>
>>>        if (max_faults) {
>>
>> Since you don't want to respect the faults info, can we simply
>> skip task placement?
> 
> This is a good suggestion. It would be even better if there were some
> feedback from others.

Although it is still a pity that, if remote node holds more hot pages
than p->numa_preferred_nid, we have to migrate them to local rather
than simply migrate the task to the hot node.