linux-kernel - Re: [PATCH] sched/numa: advanced per-cgroup numa statistic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2573b108-7885-5c4f-a0ae-2b245d663250@linux.alibaba.com>
Date:   Fri, 1 Nov 2019 19:52:15 +0800
From:   王贇 <yun.wang@...ux.alibaba.com>
To:     Mel Gorman <mgorman@...e.de>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/numa: advanced per-cgroup numa statistic



On 2019/11/1 下午5:13, Mel Gorman wrote:
[snip]
>> For example in our cases, we could have hundreds of cgroups, each contain
>> hundreds of tasks, these worker thread could live and die at any moment,
>> for gathering we need to cat the list of tasks and then go reading these proc
>> files one by one, which fall into kernel rapidly and may even need to holding
>> some locks, this introduced big latency impact, and give no accurate output
>> since some task may already died before reading it's data.
>>
>> Then before next sample window, info of tasks died during the window can't be
>> acquired anymore.
>>
>> We need kernel's help on reserving data since tool can't catch them in time
>> before they are lost, also we have to avoid rapidly proc reading, which really
>> cost a lot and further more, introduce big latency on each sample window.
>>
> 
> There is somewhat of a disconnect here. You say that the information must
> be accurate and historical yet are relying on NUMA hinting faults to build
> the picture which may not be accurate at all given that faults are not
> guaranteed to happen. For short-lived tasks, it is also potentially skewed
> information if short-lived tasks dominated remote accesses for whatever
> reason even though it does not matter -- the tasks were short-lived and
> their performance is probably irrelevant. Short-lived tasks may not even
> show up if they do not run longer than sysctl_numa_balancing_scan_delay
> so the data gathered already has holes in it.
> 
> While it's a bit more of a stretch, even this could still be done from
> userspace if numa_hint_fault was probed and the event handled (eBPF,
> systemtap etc) to build the picture or add a tracepoint. That would give
> a much higher degree of flexibility on what information is tracked and
> allow flexibility on 
> 
> So, overall I think this can be done outside the kernel but recognise
> that it may not be suitable in all cases. If you feel it must be done
> inside the kernel, split out the patch that adds information on failed
> page migrations as it stands apart. Put it behind its own kconfig entry
> that is disabled by default -- do not tie it directly to NUMA balancing
> because of the data structure changes. When enabled, it should still be
> disabled by default at runtime and only activated via kernel command line
> parameter so that the only people who pay the cost are those that take
> deliberate action to enable it.

Agree, we could have these per-task faults info there, give the possibility
to implement maybe a practical userland tool, meanwhile have these kernel
numa data disabled by default, folks who got no tool but want to do easy
monitoring can just turn on the switch :-)

Will have these in next version:

 * separate patch for showing per-task faults info
 * new CONFIG for numa stat (disabled by default)
 * dynamical runtime switch for numa stat (disabled by default)
 * doc to explain the numa stat and give hint on how to handle it

Best Regards,
Michale Wang

>