[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <9a6402ca-ce63-430b-b60b-1a36971e37e4@oracle.com>
Date: Thu, 1 May 2025 00:00:15 -0700
From: Libo Chen <libo.chen@...cle.com>
To: Chen Yu <yu.c.chen@...el.com>, Peter Zijlstra <peterz@...radead.org>,
Andrew Morton <akpm@...ux-foundation.org>
Cc: Ingo Molnar <mingo@...hat.com>, Tejun Heo <tj@...nel.org>,
Johannes Weiner <hannes@...xchg.org>, Jonathan Corbet <corbet@....net>,
Mel Gorman <mgormanmgorman@...e.de>, Michal Hocko <mhocko@...nel.org>,
Michal Koutny <mkoutny@...e.com>, Muchun Song <muchun.song@...ux.dev>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Shakeel Butt <shakeel.butt@...ux.dev>,
"Chen, Tim C" <tim.c.chen@...el.com>, Aubrey Li <aubrey.li@...el.com>,
cgroups@...r.kernel.org, linux-doc@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, K Prateek Nayak <kprateek.nayak@....com>,
Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
Subject: Re: [PATCH v3] sched/numa: add statistics of numa balance task
migration
Hi Chen Yu
On 4/30/25 03:36, Chen Yu wrote:
> On systems with NUMA balancing enabled, it is found that tracking
> the task activities due to NUMA balancing is helpful. NUMA balancing
> has two mechanisms for task migration: one is to migrate the task to
> an idle CPU in its preferred node, the other is to swap tasks on
> different nodes if they are on each other's preferred node.
>
> The kernel already has NUMA page migration statistics in
> /sys/fs/cgroup/mytest/memory.stat and /proc/{PID}/sched,
> but does not have statistics for task migration/swap.
> Add the task migration and swap count accordingly.
>
> The following two new fields:
>
> numa_task_migrated
> numa_task_swapped
>
> will be displayed in both
> /sys/fs/cgroup/{GROUP}/memory.stat and /proc/{PID}/sched
>
Both stats show up in expected places, but I notice they are also in
/proc/vmstat and are always 0.
I think you may have to add count_vm_numa_event() in migrate_task_to()
and __migrate_swap_task() unless there is a way to not show both stats
in /proc/vmstat.
> Introducing both pertask and permemcg NUMA balancing statistics helps
> to quickly evaluate the performance and resource usage of the target
> workload. For example, the user can first identify the container which
> has high NUMA balance activity and then narrow down to a specific task
> within that group, and tune the memory policy of that task.
> In summary, it is plausible to iterate the /proc/$pid/sched to find the
> offending task, but the introduction of per memcg tasks' Numa balancing
> aggregated activity can further help users identify the task in a
> divide-and-conquer way.
>
> Tested-by: K Prateek Nayak <kprateek.nayak@....com>
> Tested-by: Madadi Vineeth Reddy <vineethr@...ux.ibm.com>
> Acked-by: Peter Zijlstra (Intel) <peterz@...radead.org>
> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> ---
> v2->v3:
> Remove unnecessary p->mm check because kernel threads are
> not supported by Numa Balancing. (Libo Chen)
> v1->v2:
> Update the Documentation/admin-guide/cgroup-v2.rst. (Michal)
> ---
> Documentation/admin-guide/cgroup-v2.rst | 6 ++++++
> include/linux/sched.h | 4 ++++
> include/linux/vm_event_item.h | 2 ++
> kernel/sched/core.c | 7 +++++--
> kernel/sched/debug.c | 4 ++++
> mm/memcontrol.c | 2 ++
> mm/vmstat.c | 2 ++
> 7 files changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 1a16ce68a4d7..d346f3235945 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1670,6 +1670,12 @@ The following nested keys are defined.
> numa_hint_faults (npn)
> Number of NUMA hinting faults.
>
> + numa_task_migrated (npn)
> + Number of task migration by NUMA balancing.
> +
> + numa_task_swapped (npn)
> + Number of task swap by NUMA balancing.
> +
> pgdemote_kswapd
> Number of pages demoted by kswapd.
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..1c50e30b5c01 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -549,6 +549,10 @@ struct sched_statistics {
> u64 nr_failed_migrations_running;
> u64 nr_failed_migrations_hot;
> u64 nr_forced_migrations;
> +#ifdef CONFIG_NUMA_BALANCING
> + u64 numa_task_migrated;
> + u64 numa_task_swapped;
> +#endif
>
This one is more of personal preference. I understand they show up only if
you turn on schedstats, but will it be better to put them in sched_show_numa()
so they will be printed out next to other numa stats such as numa_pages_migrated?
@@ -1153,6 +1153,10 @@ static void sched_show_numa(struct task_struct *p, struct seq_file *m)
if (p->mm)
P(mm->numa_scan_seq);
+ if (schedstat_enabled()) {
+ P_SCHEDSTAT(numa_task_migrated);
+ P_SCHEDSTAT(numa_task_swapped);
+ }
P(numa_pages_migrated);
P(numa_preferred_nid);
P(total_numa_faults);
Thanks,
Libo
> u64 nr_wakeups;
> u64 nr_wakeups_sync;
> diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
> index 9e15a088ba38..91a3ce9a2687 100644
> --- a/include/linux/vm_event_item.h
> +++ b/include/linux/vm_event_item.h
> @@ -66,6 +66,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
> NUMA_HINT_FAULTS,
> NUMA_HINT_FAULTS_LOCAL,
> NUMA_PAGE_MIGRATE,
> + NUMA_TASK_MIGRATE,
> + NUMA_TASK_SWAP,
> #endif
> #ifdef CONFIG_MIGRATION
> PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c81cf642dba0..25a92f2abda4 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3352,6 +3352,9 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> #ifdef CONFIG_NUMA_BALANCING
> static void __migrate_swap_task(struct task_struct *p, int cpu)
> {
> + __schedstat_inc(p->stats.numa_task_swapped);
> + count_memcg_events_mm(p->mm, NUMA_TASK_SWAP, 1);
> +
> if (task_on_rq_queued(p)) {
> struct rq *src_rq, *dst_rq;
> struct rq_flags srf, drf;
> @@ -7953,8 +7956,8 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
> if (!cpumask_test_cpu(target_cpu, p->cpus_ptr))
> return -EINVAL;
>
> - /* TODO: This is not properly updating schedstats */
> -
> + __schedstat_inc(p->stats.numa_task_migrated);
> + count_memcg_events_mm(p->mm, NUMA_TASK_MIGRATE, 1);
> trace_sched_move_numa(p, curr_cpu, target_cpu);
> return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> }
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 56ae54e0ce6a..f971c2af7912 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -1206,6 +1206,10 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
> P_SCHEDSTAT(nr_failed_migrations_running);
> P_SCHEDSTAT(nr_failed_migrations_hot);
> P_SCHEDSTAT(nr_forced_migrations);
> +#ifdef CONFIG_NUMA_BALANCING
> + P_SCHEDSTAT(numa_task_migrated);
> + P_SCHEDSTAT(numa_task_swapped);
> +#endif
> P_SCHEDSTAT(nr_wakeups);
> P_SCHEDSTAT(nr_wakeups_sync);
> P_SCHEDSTAT(nr_wakeups_migrate);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c96c1f2b9cf5..cdaab8a957f3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -463,6 +463,8 @@ static const unsigned int memcg_vm_event_stat[] = {
> NUMA_PAGE_MIGRATE,
> NUMA_PTE_UPDATES,
> NUMA_HINT_FAULTS,
> + NUMA_TASK_MIGRATE,
> + NUMA_TASK_SWAP,
> #endif
> };
>
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 4c268ce39ff2..ed08bb384ae4 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1347,6 +1347,8 @@ const char * const vmstat_text[] = {
> "numa_hint_faults",
> "numa_hint_faults_local",
> "numa_pages_migrated",
> + "numa_task_migrated",
> + "numa_task_swapped",
> #endif
> #ifdef CONFIG_MIGRATION
> "pgmigrate_success",
Powered by blists - more mailing lists