[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <bc93c650-ba55-4434-98f6-3b7f556ae44b@intel.com>
Date: Tue, 6 May 2025 13:36:54 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Libo Chen <libo.chen@...cle.com>
CC: "Jain, Ayush" <ayushjai@....com>, Andrew Morton
<akpm@...ux-foundation.org>, Ingo Molnar <mingo@...hat.com>, Tejun Heo
<tj@...nel.org>, Johannes Weiner <hannes@...xchg.org>, Jonathan Corbet
<corbet@....net>, Mel Gorman <mgormanmgorman@...e.de>, Michal Hocko
<mhocko@...nel.org>, Muchun Song <muchun.song@...ux.dev>, Roman Gushchin
<roman.gushchin@...ux.dev>, Shakeel Butt <shakeel.butt@...ux.dev>, "Chen, Tim
C" <tim.c.chen@...el.com>, Aubrey Li <aubrey.li@...el.com>,
<cgroups@...r.kernel.org>, <linux-doc@...r.kernel.org>, <linux-mm@...ck.org>,
<linux-kernel@...r.kernel.org>, K Prateek Nayak <kprateek.nayak@....com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>, <Neeraj.Upadhyay@....com>,
Peter Zijlstra <peterz@...radead.org>, Michal Koutný
<mkoutny@...e.com>
Subject: Re: [PATCH v3] sched/numa: add statistics of numa balance task
migration
On 5/6/2025 5:57 AM, Libo Chen wrote:
>
>
> On 5/5/25 14:32, Libo Chen wrote:
>>
>>
>> On 5/5/25 11:49, Libo Chen wrote:
>>>
>>>
>>> On 5/5/25 11:27, Chen, Yu C wrote:
>>>> Hi Michal,
>>>>
>>>> On 5/6/2025 1:46 AM, Michal Koutný wrote:
>>>>> On Mon, May 05, 2025 at 11:03:10PM +0800, "Chen, Yu C" <yu.c.chen@...el.com> wrote:
>>>>>> According to this address,
>>>>>> 4c 8b af 50 09 00 00 mov 0x950(%rdi),%r13 <--- r13 = p->mm;
>>>>>> 49 8b bd 98 04 00 00 mov 0x498(%r13),%rdi <--- p->mm->owner
>>>>>> It seems that this task to be swapped has NULL mm_struct.
>>>>>
>>>>> So it's likely a kernel thread. Does it make sense to NUMA balance
>>>>> those? (I naïvely think it doesn't, please correct me.) ...
>>>>>
>>>>
>>>> I agree kernel threads are not supposed to be covered by
>>>> NUMA balance, because currently NUMA balance only considers
>>>> user pages via VMAs, and one question below:
>>>>
>>>>>> static void __migrate_swap_task(struct task_struct *p, int cpu)
>>>>>> {
>>>>>> __schedstat_inc(p->stats.numa_task_swapped);
>>>>>> - count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>>>>>> + if (p->mm)
>>>>>> + count_memcg_event_mm(p->mm, NUMA_TASK_SWAP);
>>>>>
>>>>> ... proper fix should likely guard this earlier, like the guard in
>>>>> task_numa_fault() but for the other swapped task.
>>>> I see. For task swapping in task_numa_compare(),
>>>> it is triggered when there are no idle CPUs in task A's
>>>> preferred node.
>>>> In this case, we choose a task B on A's preferred node,
>>>> and swap B with A. This helps improve A's Numa locality
>>>> without introducing the load imbalance between Nodes.
>>>>
>> Hi Chenyu
>>
>> There are two problems here:
>> 1. Many kthreads are pinned, with all the efforts in task_numa_compare()
>> and task_numa_find_cpu(), the swapping may not end up happening. I only see a
>> check on source task: cpumask_test_cpu(cpu, env->p->cpus_ptr) but not dst task.
>
> NVM I was blind. There is a check on dst task in task_numa_compare()
>
>> 2. Assuming B is migratable, that can potentially make B worse, right? I think
>> some kthreads are quite cache-sensitive, and we swap like their locality doesn't
>> matter.
This makes sense. I wonder if it could be extended beyond kthreads.
We don't want to swap task B that has no explicit NUMA preference,
do we?
>>
>> Ideally we probably just want to stay off kthreads, if we cannot find any others
>> p->mm tasks, just don't swap (?). That sounds like a brand new patch though.
>>
>
> A change as simple as that should work:
>
> @@ -2492,7 +2492,7 @@ static bool task_numa_compare(struct task_numa_env *env,
>
> rcu_read_lock();
> cur = rcu_dereference(dst_rq->curr);
> - if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
> + if (cur && ((cur->flags & PF_EXITING) || !cur->mm || is_idle_task(cur)))
something like
if (cur && ((cur->flags & PF_EXITING) ||
cur->numa_preferred_nid == NUMA_NO_NODE ||
!cur->numa_faults || is_idle_task(cur)))
But overall it looks good to me, would you like to post this as a
formal patch, or do you want me to fold your change into a patch set?
thanks,
Chenyu
> cur = NULL;
>
>>
>>
>> Libo
>>>> But B's Numa node preference is not mandatory in
>>>> current implementation IIUC, because B's load is mainly
>>>
>>> hmm, that's doesn't seem to be right, can we choose B that
>>> is not a kthread from A's preferred node?
>>>
>>>> considered. That is to say, is it legit to swap a
>>>> Numa sensitive task A with a non-Numa sensitive kernel
>>>> thread B? If not, I think we can add kernel thread
>>>> check in task swap like the guard in
>>>> task_tick_numa()/task_numa_fault().
>>>>
>>>
>>>
>>>> thanks,
>>>> Chenyu
>>>>
>>>>>
>>>>> Michal
>>>>
>>>
>>
>
Powered by blists - more mailing lists