linux-kernel - Re: [PATCH] sched/numa: advanced per-cgroup numa statistic

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191028130222.GM4131@hirez.programming.kicks-ass.net>
Date:   Mon, 28 Oct 2019 14:02:22 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     王贇 <yun.wang@...ux.alibaba.com>
Cc:     Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] sched/numa: advanced per-cgroup numa statistic

On Thu, Oct 24, 2019 at 11:08:01AM +0800, 王贇 wrote:
> Currently there are no good approach to monitoring the per-cgroup
> numa efficiency, this could be a trouble especially when groups
> are sharing CPUs, it's impossible to tell which one caused the
> remote-memory access by reading hardware counter since multiple
> workloads could sharing the same CPU, which make it painful when
> one want to find out the root cause and fix the issue.
> 
> In order to address this, we introduced new per-cgroup statistic
> for numa:
>   * the numa locality to imply the numa balancing efficiency
>   * the numa execution time on each node
> 
> The task locality is the local page accessing ratio traced on numa
> balancing PF, and the group locality is the topology of task execution
> time, sectioned by the locality into 8 regions.
> 
> For example the new entry 'cpu.numa_stat' show:
>   locality 15393 21259 13023 44461 21247 17012 28496 145402
>   exectime 311900 407166
> 
> Here we know the workloads executed 311900ms on node_0 and 407166ms
> on node_1, tasks with locality around 0~12% executed for 15393 ms, and
> tasks with locality around 88~100% executed for 145402 ms, which imply
> most of the memory access is local access, for the workloads of this
> group.
> 
> By monitoring the new statistic, we will be able to know the numa
> efficiency of each per-cgroup workloads on machine, whatever they
> sharing the CPUs or not, we will be able to find out which one
> introduced the remote access mostly.
> 
> Besides, per-node memory topology from 'memory.numa_stat' become
> more useful when we have the per-node execution time, workloads
> always executing on node_0 while it's memory is all on node_1 is
> usually a bad case.
> 
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Michal Koutný <mkoutny@...e.com>
> Signed-off-by: Michael Wang <yun.wang@...ux.alibaba.com>

Mel, can you have a peek at this too?


So this is the part I like least:

> +DEFINE_PER_CPU(struct numa_stat, root_numa_stat);
> +
> +int alloc_tg_numa_stat(struct task_group *tg)
> +{
> +	tg->numa_stat = alloc_percpu(struct numa_stat);
> +	if (!tg->numa_stat)
> +		return 0;
> +
> +	return 1;
> +}
> +
> +void free_tg_numa_stat(struct task_group *tg)
> +{
> +	free_percpu(tg->numa_stat);
> +}
> +
> +static void update_tg_numa_stat(struct task_struct *p)
> +{
> +	struct task_group *tg;
> +	unsigned long remote = p->numa_faults_locality[3];
> +	unsigned long local = p->numa_faults_locality[4];
> +	int idx = -1;
> +
> +	/* Tobe scaled? */
> +	if (remote || local)
> +		idx = NR_NL_INTERVAL * local / (remote + local + 1);
> +
> +	rcu_read_lock();
> +
> +	tg = task_group(p);
> +	while (tg) {
> +		/* skip account when there are no faults records */
> +		if (idx != -1)
> +			this_cpu_inc(tg->numa_stat->locality[idx]);
> +
> +		this_cpu_inc(tg->numa_stat->jiffies);
> +
> +		tg = tg->parent;
> +	}
> +
> +	rcu_read_unlock();
> +}

Thing is, we already have a cgroup hierarchy walk in the tick; see
task_tick_fair().

On top of that, you're walking an entirely different set of pointers,
instead of cfs_rq, you're walking tg->parent, which pretty much
guarantees you're adding even more cache misses.

How about you stick those numa_stats in cfs_rq (with cacheline
alignment) and see if you can frob your update loop into the cgroup walk
we already do.