[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fpa42ohp54ewxxymaclnmiafdlfs7lbddnqhtv7haksdd5jq6z@mb6jxk3pl2m2>
Date: Tue, 27 May 2025 11:15:33 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: "Chen, Yu C" <yu.c.chen@...el.com>
Cc: Michal Koutný <mkoutny@...e.com>,
peterz@...radead.org, akpm@...ux-foundation.org, mingo@...hat.com, tj@...nel.org,
hannes@...xchg.org, corbet@....net, mgorman@...e.de, mhocko@...nel.org,
muchun.song@...ux.dev, roman.gushchin@...ux.dev, tim.c.chen@...el.com,
aubrey.li@...el.com, libo.chen@...cle.com, kprateek.nayak@....com,
vineethr@...ux.ibm.com, venkat88@...ux.ibm.com, ayushjai@....com,
cgroups@...r.kernel.org, linux-doc@...r.kernel.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, yu.chen.surf@...mail.com
Subject: Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
On Tue, May 27, 2025 at 05:20:54PM +0800, Chen, Yu C wrote:
> On 5/26/2025 9:35 PM, Michal Koutný wrote:
> > On Fri, May 23, 2025 at 04:42:50PM -0700, Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > > Hmm these are scheduler events, how are these relevant to memory cgroup
> > > or vmstat? Any reason to not expose these in cpu.stat?
> >
> > Good point. If I take it further -- this functionality needs neither
> > memory controller (CONFIG_MEMCG) nor CPU controller
> > (CONFIG_CGROUP_SCHED), so it might be technically calculated and exposed
> > in _any_ cgroup (which would be same technical solution how cpu time is
> > counted in cpu.stat regardless of CPU controller, cpu_stat_show()).
> >
>
> Yes, we can add it to cpu.stat. However, this might make it more difficult
> for users to locate related events. Some statistics about NUMA page
> migrations/faults are recorded in memory.stat, while others about NUMA task
> migrations (triggered by NUMA faults periodicly) are stored in cpu.stat.
>
> Do you recommend extending the struct cgroup_base_stat to include counters
> for task_migrate/task_swap? Additionally, should we enhance
> cgroup_base_stat_cputime_show() to parse task_migrate/task_swap in a manner
> similar to cputime?
>
> Alternatively, as Shakeel previously mentioned, could we reuse
> "count_memcg_event_mm()" and related infrastructure while exposing these
> statistics/events in cpu.stat? I assume Shakeel was referring to the
> following
> approach:
>
> 1. Skip task migration/swap in memory.stat:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cdaab8a957f3..b8eea3eca46f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1529,6 +1529,11 @@ static void memcg_stat_format(struct mem_cgroup
> *memcg, struct seq_buf *s)
> if (memcg_vm_event_stat[i] == PGPGIN ||
> memcg_vm_event_stat[i] == PGPGOUT)
> continue;
> +#endif
> +#ifdef CONFIG_NUMA_BALANCING
> + if (memcg_vm_event_stat[i] == NUMA_TASK_MIGRATE ||
> + memcg_vm_event_stat[i] == NUMA_TASK_SWAP)
> + continue;
> #endif
>
> 2.Skip task migration/swap in /proc/vmstat
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index ed08bb384ae4..ea8a8ae1cdac 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1912,6 +1912,10 @@ static void *vmstat_next(struct seq_file *m, void
> *arg, loff_t *pos)
> (*pos)++;
> if (*pos >= NR_VMSTAT_ITEMS)
> return NULL;
> +#ifdef CONFIG_NUMA_BALANCING
> + if (*pos == NUMA_TASK_MIGRATE || *pos == NUMA_TASK_SWAP)
> + return NULL;
> +#endif
>
> 3. Display task migration/swap events in cpu.stat:
> seq_buf_printf(&s, "%s %lu\n",
> + vm_event_name(memcg_vm_event_stat[NUMA_TASK_MIGRATE]),
> + memcg_events(memcg,
> memcg_vm_event_stat[NUMA_TASK_MIGRATE]));
>
You would need to use memcg_events() and you will need to flush the
memcg rstat trees as well
>
> It looks like more code is needed. Michal, Shakeel, could you please advise
> which strategy is preferred, or should we keep the current version?
I am now more inclined to keep these new stats in memory.stat as the
current version is doing because:
1. Relevant stats are exposed through the same interface and we already
have numa balancing stats in memory.stat.
2. There is no single good home for these new stats and exposing them in
cpu.stat would require more code and even if we reuse memcg infra, we
would still need to flush the memcg stats, so why not just expose in
the memory.stat.
3. Though a bit far fetched, I think we may add more stats which sit at
the boundary of sched and mm in future. Numa balancing is one
concrete example of such stats. I am envisioning for reliable memory
reclaim or overcommit, there might be some useful events as well.
Anyways it is still unbaked atm.
Michal, let me know your thought on this.
Powered by blists - more mailing lists