[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <c6bfa201-ed88-47df-9402-ead65d7be475@intel.com>
Date: Tue, 3 Jun 2025 22:46:06 +0800
From: "Chen, Yu C" <yu.c.chen@...el.com>
To: Michal Koutný <mkoutny@...e.com>, Shakeel Butt
<shakeel.butt@...ux.dev>
CC: <peterz@...radead.org>, <akpm@...ux-foundation.org>, <mingo@...hat.com>,
<tj@...nel.org>, <hannes@...xchg.org>, <corbet@....net>, <mgorman@...e.de>,
<mhocko@...nel.org>, <muchun.song@...ux.dev>, <roman.gushchin@...ux.dev>,
<tim.c.chen@...el.com>, <aubrey.li@...el.com>, <libo.chen@...cle.com>,
<kprateek.nayak@....com>, <vineethr@...ux.ibm.com>, <venkat88@...ux.ibm.com>,
<ayushjai@....com>, <cgroups@...r.kernel.org>, <linux-doc@...r.kernel.org>,
<linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>,
<yu.chen.surf@...mail.com>
Subject: Re: [PATCH v5 2/2] sched/numa: add statistics of numa balance task
Hi Michal,
On 6/3/2025 12:53 AM, Michal Koutný wrote:
> On Tue, May 27, 2025 at 11:15:33AM -0700, Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>> I am now more inclined to keep these new stats in memory.stat as the
>> current version is doing because:
>>
>> 1. Relevant stats are exposed through the same interface and we already
>> have numa balancing stats in memory.stat.
>>
>> 2. There is no single good home for these new stats and exposing them in
>> cpu.stat would require more code and even if we reuse memcg infra, we
>> would still need to flush the memcg stats, so why not just expose in
>> the memory.stat.
>>
>> 3. Though a bit far fetched, I think we may add more stats which sit at
>> the boundary of sched and mm in future. Numa balancing is one
>> concrete example of such stats. I am envisioning for reliable memory
>> reclaim or overcommit, there might be some useful events as well.
>> Anyways it is still unbaked atm.
>>
>>
>> Michal, let me know your thought on this.
>
> I reckon users may be little bit more likely to look that info in
> memory.stat.
>
> Which would be OK unless threaded subtrees are considered (e.g. cpuset
> (NUMA affinity) has thread granularity) and these migration stats are
> potentially per-thread relevant.
>
>
> I was also pondering why cannot be misplaced container found by existing
> NUMA stats. Chen has explained task vs page migration in NUMA balancing.
> I guess mere page migration number (especially when stagnating) may not
> point to the the misplaced container. OK.
>
> Second thing is what is the "misplaced" container. Is it because of
> wrong set_mempolicy(2) or cpuset configuration?
> If it's the former (i.e.
> it requires enabled cpuset controller), it'd justify exposing this info
> in cpuset.stat, if it's the latter, the cgroup aggregation is not that
> relevant (hence /proc/<PID>/sched) is sufficient. Or is there another
> meaning of a misplaced container? Chen, could you please clarify?
My understanding is that the "misplaced" container is not strictly tied
to set_mempolicy or cpuset configuration, but is mainly caused by the
scheduler's generic load balancer. The generic load balancer spreads
tasks across different nodes to fully utilize idle CPUs, while NUMA
balancing tries to pull misplaced tasks/pages back to honor NUMA locality.
Regarding the threaded subtrees mode, I was previously unfamiliar with
it and have been trying to understand it better. If I understand correctly,
if threads within a single process are placed in different cgroups via
cpuset,
we might need to scan /proc/<PID>/sched to collect NUMA task migration/swap
statistics. If threaded subtrees are disabled for that process, we can query
memory.stat.
I agree with your prior point that NUMA balancing task activity is not
directly
associated with either the Memory controller or the CPU controller. Although
showing this data in cpu.stat might seem more appropriate, we expose it in
memory.stat due to the following trade-offs(or as an exception for
NUMA balancing):
1.It aligns with existing NUMA-related metrics already present in
memory.stat.
2.It simplifies code implementation.
thanks,
Chenyu
>
> Because memory controller doesn't control NUMA, it needn't be enabled
> to have this statistics and it cannot be enabled in threaded groups, I'm
> having some doubts whether memory.stat is a good home for this field.
>
> Regards,
> Michal
Powered by blists - more mailing lists