[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2573b108-7885-5c4f-a0ae-2b245d663250@linux.alibaba.com>
Date: Fri, 1 Nov 2019 19:52:15 +0800
From: ηθ΄ <yun.wang@...ux.alibaba.com>
To: Mel Gorman <mgorman@...e.de>
Cc: Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>,
Juri Lelli <juri.lelli@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/numa: advanced per-cgroup numa statistic
On 2019/11/1 δΈε5:13, Mel Gorman wrote:
[snip]
>> For example in our cases, we could have hundreds of cgroups, each contain
>> hundreds of tasks, these worker thread could live and die at any moment,
>> for gathering we need to cat the list of tasks and then go reading these proc
>> files one by one, which fall into kernel rapidly and may even need to holding
>> some locks, this introduced big latency impact, and give no accurate output
>> since some task may already died before reading it's data.
>>
>> Then before next sample window, info of tasks died during the window can't be
>> acquired anymore.
>>
>> We need kernel's help on reserving data since tool can't catch them in time
>> before they are lost, also we have to avoid rapidly proc reading, which really
>> cost a lot and further more, introduce big latency on each sample window.
>>
>
> There is somewhat of a disconnect here. You say that the information must
> be accurate and historical yet are relying on NUMA hinting faults to build
> the picture which may not be accurate at all given that faults are not
> guaranteed to happen. For short-lived tasks, it is also potentially skewed
> information if short-lived tasks dominated remote accesses for whatever
> reason even though it does not matter -- the tasks were short-lived and
> their performance is probably irrelevant. Short-lived tasks may not even
> show up if they do not run longer than sysctl_numa_balancing_scan_delay
> so the data gathered already has holes in it.
>
> While it's a bit more of a stretch, even this could still be done from
> userspace if numa_hint_fault was probed and the event handled (eBPF,
> systemtap etc) to build the picture or add a tracepoint. That would give
> a much higher degree of flexibility on what information is tracked and
> allow flexibility on
>
> So, overall I think this can be done outside the kernel but recognise
> that it may not be suitable in all cases. If you feel it must be done
> inside the kernel, split out the patch that adds information on failed
> page migrations as it stands apart. Put it behind its own kconfig entry
> that is disabled by default -- do not tie it directly to NUMA balancing
> because of the data structure changes. When enabled, it should still be
> disabled by default at runtime and only activated via kernel command line
> parameter so that the only people who pay the cost are those that take
> deliberate action to enable it.
Agree, we could have these per-task faults info there, give the possibility
to implement maybe a practical userland tool, meanwhile have these kernel
numa data disabled by default, folks who got no tool but want to do easy
monitoring can just turn on the switch :-)
Will have these in next version:
* separate patch for showing per-task faults info
* new CONFIG for numa stat (disabled by default)
* dynamical runtime switch for numa stat (disabled by default)
* doc to explain the numa stat and give hint on how to handle it
Best Regards,
Michale Wang
>
Powered by blists - more mailing lists