linux-kernel - Re: [PATCH v3] sched/numa: add statistics of numa balance task migration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c1ec0cc1-8a1e-4db6-927e-5a1422f2c191@oracle.com>
Date: Tue, 6 May 2025 00:03:59 -0700
From: Libo Chen <libo.chen@...cle.com>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: "Jain, Ayush" <ayushjai@....com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Ingo Molnar <mingo@...hat.com>, Tejun Heo <tj@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>, Jonathan Corbet <corbet@....net>,
        Mel Gorman <mgorman@...e.de>, Michal Hocko <mhocko@...nel.org>,
        Muchun Song <muchun.song@...ux.dev>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Shakeel Butt <shakeel.butt@...ux.dev>,
        "Chen, Tim C" <tim.c.chen@...el.com>, Aubrey Li <aubrey.li@...el.com>,
        cgroups@...r.kernel.org, linux-doc@...r.kernel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, K Prateek Nayak <kprateek.nayak@....com>,
        Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, Neeraj.Upadhyay@....com,
        Peter Zijlstra <peterz@...radead.org>,
        Michal Koutný
 <mkoutny@...e.com>
Subject: Re: [PATCH v3] sched/numa: add statistics of numa balance task
 migration



On 5/5/25 22:36, Chen, Yu C wrote:
> On 5/6/2025 5:57 AM, Libo Chen wrote:
>>
>>
>> On 5/5/25 14:32, Libo Chen wrote:
>>>
>>>
>>> On 5/5/25 11:49, Libo Chen wrote:
>>>>
>>>>
>>>> On 5/5/25 11:27, Chen, Yu C wrote:
>>>>> Hi Michal,
>>>>>
>>>>> On 5/6/2025 1:46 AM, Michal Koutný wrote:
>>>>>> On Mon, May 05, 2025 at 11:03:10PM +0800, "Chen, Yu C" <yu.c.chen@...el.com> wrote:
>>>>>>> According to this address,
>>>>>>>      4c 8b af 50 09 00 00    mov    0x950(%rdi),%r13  <--- r13 = p->mm;
>>>>>>>      49 8b bd 98 04 00 00    mov    0x498(%r13),%rdi  <--- p->mm->owner
>>>>>>> It seems that this task to be swapped has NULL mm_struct.
>>>>>>
>>>>>> So it's likely a kernel thread. Does it make sense to NUMA balance
>>>>>> those? (I naïvely think it doesn't, please correct me.) ...
>>>>>>
>>>>>
>>>>> I agree kernel threads are not supposed to be covered by
>>>>> NUMA balance, because currently NUMA balance only considers
>>>>> user pages via VMAs, and one question below:
>>>>>
>>>>>>>    static void __migrate_swap_task(struct task_struct *p, int cpu)
>>>>>>>    {
>>>>>>>           __schedstat_inc(p->stats.numa_task_swapped);
>>>>>>> -       count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>>>>>>> +       if (p->mm)
>>>>>>> +               count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>>>>>>
>>>>>> ... proper fix should likely guard this earlier, like the guard in
>>>>>> task_numa_fault() but for the other swapped task.
>>>>> I see. For task swapping in task_numa_compare(),
>>>>> it is triggered when there are no idle CPUs in task A's
>>>>> preferred node.
>>>>> In this case, we choose a task B on A's preferred node,
>>>>> and swap B with A. This helps improve A's Numa locality
>>>>> without introducing the load imbalance between Nodes.
>>>>>
>>> Hi Chenyu
>>>
>>> There are two problems here:
>>> 1. Many kthreads are pinned, with all the efforts in task_numa_compare()
>>> and task_numa_find_cpu(), the swapping may not end up happening. I only see a
>>> check on source task: cpumask_test_cpu(cpu, env->p->cpus_ptr) but not dst task.
>>
>> NVM I was blind. There is a check on dst task in task_numa_compare()
>>
>>> 2. Assuming B is migratable, that can potentially make B worse, right? I think
>>> some kthreads are quite cache-sensitive, and we swap like their locality doesn't
>>> matter.
> 
> This makes sense. I wonder if it could be extended beyond kthreads.
> We don't want to swap task B that has no explicit NUMA preference,
> do we?
> 

I agree, at least that should be the default behavior.

>>>
>>> Ideally we probably just want to stay off kthreads, if we cannot find any others
>>> p->mm tasks, just don't swap (?). That sounds like a brand new patch though.
>>>
>>
>> A change as simple as that should work:
>>
>> @@ -2492,7 +2492,7 @@ static bool task_numa_compare(struct task_numa_env *env,
>>
>>          rcu_read_lock();
>>          cur = rcu_dereference(dst_rq->curr);
>> -       if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
>> +       if (cur && ((cur->flags & PF_EXITING) || !cur->mm || is_idle_task(cur)))
> 
> something like
> if (cur && ((cur->flags & PF_EXITING) ||
>     cur->numa_preferred_nid == NUMA_NO_NODE ||
>    !cur->numa_faults || is_idle_task(cur)))
> 

This implicitly skips kthreads, probably need some comment. Otherwise LGTM

> But overall it looks good to me, would you like to post this as a
> formal patch, or do you want me to fold your change into a patch set?
> 

You can fold it into one set.

Thanks,
Libo

> thanks,
> Chenyu
> 
>>                  cur = NULL;
>>
> 
> 
>  
> 
>>>
>>>
>>> Libo
>>>>> But B's Numa node preference is not mandatory in
>>>>> current implementation IIUC, because B's load is mainly
>>>>
>>>> hmm, that's doesn't seem to be right, can we choose B that
>>>> is not a kthread from A's preferred node?
>>>>
>>>>> considered. That is to say, is it legit to swap a
>>>>> Numa sensitive task A with a non-Numa sensitive kernel
>>>>> thread B? If not, I think we can add kernel thread
>>>>> check in task swap like the guard in
>>>>> task_tick_numa()/task_numa_fault().
>>>>>
>>>>
>>>>
>>>>> thanks,
>>>>> Chenyu
>>>>>
>>>>>>
>>>>>> Michal
>>>>>
>>>>
>>>
>>