linux-kernel - Re: [PATCH v6 1/2] sched/numa: introduce per-cgroup NUMA locality info

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <b32569f2-3f5c-06f3-dba7-67351a019c42@linux.alibaba.com>
Date:   Sat, 4 Jan 2020 12:51:37 +0800
From:   王贇 <yun.wang@...ux.alibaba.com>
To:     Michal Koutný <mkoutny@...e.com>
Cc:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Iurii Zaikin <yzaikin@...gle.com>,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        linux-doc@...r.kernel.org,
        "Paul E. McKenney" <paulmck@...ux.ibm.com>,
        Randy Dunlap <rdunlap@...radead.org>,
        Jonathan Corbet <corbet@....net>
Subject: Re: [PATCH v6 1/2] sched/numa: introduce per-cgroup NUMA locality
 info



On 2020/1/3 下午11:14, Michal Koutný wrote:
> Hi.
> 
> On Fri, Dec 13, 2019 at 09:47:36AM +0800, 王贇 <yun.wang@...ux.alibaba.com> wrote:
>> By monitoring the increments, we will be able to locate the per-cgroup
>> workload which NUMA Balancing can't helpwith (usually caused by wrong
>> CPU and memory node bindings), then we got chance to fix that in time.
> I just wonder do the data based on increments match with those you
> obtained previously?

They have different meaning, since now it's just the accumulation of
local/remote page access counter, we have to increasing the sample
period into the maximum NUMA balancing scan period, to my system it's
1 minute.

We still get useful information from the increments, for example:
  local 100 remote 1000 <-- bad locality in last period
  local 0 remote 0 <-- no scan or NUMA PF happened in last period
  local 100 remote 0 <-- good locality but not much PF happened

So I won't say they are matched, they tell the story in different way :-P

> 
>> +static inline void
>> +update_task_locality(struct task_struct *p, int pnid, int cnid, int pages)
>> +{
>> +	if (!static_branch_unlikely(&sched_numa_locality))
>> +		return;
>> +
>> +	/*
>> +	 * pnid != cnid --> remote idx 0
>> +	 * pnid == cnid --> local idx 1
>> +	 */
>> +	p->numa_page_access[!!(pnid == cnid)] += pages;
> If the per-task information isn't used anywhere, why not accumulate
> directly into task's cfs_rq->{local,remote}_page_access?
> 

This is try to avoid hierarchy update in each PF, accumulate the counter
and update together should cost less.

Besides, as they won't be reset now, maybe we could expose them too.

>> @@ -4298,6 +4359,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>>  	 */
>>  	update_load_avg(cfs_rq, curr, UPDATE_TG);
>>  	update_cfs_group(curr);
>> +	update_group_locality(cfs_rq);
> With the per-NUMA node time tracked separately, isn't it unnecessary
> doing group updates inside entity_tick? 

The hierarchy update can't be saved, and this is a good place where we
already holding rq lock, iterate cfs_rq in hierarchy for current task.

Regards,
Michael Wang

> 
> 
> Regards,
> Michal
>