linux-kernel - [RFC PATCH v2 1/3] perf: filter container events based on cgroup namespace

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <146965484711.23765.5878825588596955069.stgit@hbathini.in.ibm.com>
Date:	Thu, 28 Jul 2016 02:57:27 +0530
From:	Hari Bathini <hbathini@...ux.vnet.ibm.com>
To:	daniel@...earbox.net, peterz@...radead.org,
	linux-kernel@...r.kernel.org, acme@...nel.org,
	alexander.shishkin@...ux.intel.com, mingo@...hat.com,
	paulus@...ba.org, ebiederm@...ssion.com, kernel@...p.com,
	rostedt@...dmis.org, viro@...iv.linux.org.uk
Cc:	aravinda@...ux.vnet.ibm.com, ananth@...ibm.com
Subject: [RFC PATCH v2 1/3] perf: filter container events based on cgroup
 namespace

From: Aravinda Prasad <aravinda@...ux.vnet.ibm.com>

This patch adds support to filter container specific events, without
any change in the user interface, when invoked within a container for
the perf utility.

Our earlier patch [1] required the container to be created with PID
namespace. However, during the discussion in Plumbers it was mentioned
that the requirement of PID namespace is insufficient for containers
that need access to the host PID namespace [3]. Now that the kernel
supports cgroup namespace, we modified the patch to look for cgroup
namespace instead of pid namespace to filter events. Thus keeping
the basic idea of approach [1] same while addressing [3].

The patch assumes that tracefs is available within the container and
all the processes running inside the container are grouped into a
single perf_event subsystem of cgroups.


Running the below command inside a container with global cgroup namespace

  $ perf record -e kmem:kmalloc -aR

perf report looks like below (with lot of noise):

  $ perf report --sort pid,symbol -n
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 8K of event 'kmem:kmalloc'
  # Event count (approx.): 8487
  #
  # Overhead       Samples    Pid:Command        Symbol
  # ........  ............  ...................  ..........................
  #
      71.56%          6073      0:kworker/dying  [k] __kmalloc
      26.82%          2276      0:kworker/dying  [k] kmem_cache_alloc_trace
       1.48%           126      0:kworker/dying  [k] __kmalloc_track_caller
       0.07%             6      0:curl           [k] kmalloc_order_trace
       0.05%             4    186:perf           [k] __kmalloc
       0.02%             2     61:java           [k] __kmalloc


  $

while running the above perf record command inside a container with new
cgroup namespace, only samples that belong to this container are listed:

  $ perf report --sort pid,dso,symbol -n
  #
  #
  # Total Lost Samples: 0
  #
  # Samples: 3  of event 'kmem:kmalloc'
  # Event count (approx.): 3
  #
  # Overhead       Samples    Pid:Command  Symbol
  # ........  ............  .............  .............
  #
     100.00%             3     61:java     [k] __kmalloc


  $

In order to filter events specific to a container, this patch assumes the
container is created with a new cgroup namespace.

[1] https://lkml.org/lkml/2015/7/15/192
[2] http://linuxplumbersconf.org/2015/ocw/sessions/2667.html
[3] Notes for container-aware tracing:
	https://etherpad.openstack.org/p/LPC2015_Containers

Signed-off-by: Aravinda Prasad <aravinda@...ux.vnet.ibm.com>
Signed-off-by: Hari Bathini <hbathini@...ux.vnet.ibm.com>
---
 kernel/events/core.c |   51 +++++++++++++++++++++++++++++++++++---------------
 1 file changed, 36 insertions(+), 15 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 43d43a2d..d7ef1e1 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -764,17 +764,38 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 {
 	struct perf_cgroup *cgrp;
 	struct cgroup_subsys_state *css;
-	struct fd f = fdget(fd);
+	struct fd f;
 	int ret = 0;
 
-	if (!f.file)
-		return -EBADF;
+	if (fd != -1) {
+		f = fdget(fd);
+		if (!f.file)
+			return -EBADF;
 
-	css = css_tryget_online_from_dir(f.file->f_path.dentry,
-					 &perf_event_cgrp_subsys);
-	if (IS_ERR(css)) {
-		ret = PTR_ERR(css);
-		goto out;
+		css = css_tryget_online_from_dir(f.file->f_path.dentry,
+						 &perf_event_cgrp_subsys);
+		if (IS_ERR(css)) {
+			ret = PTR_ERR(css);
+			fdput(f);
+			return ret;
+		}
+	} else if (event->attach_state == PERF_ATTACH_TASK) {
+		/* Tracing on a PID. No need to set event->cgrp */
+		return ret;
+	} else if (current->nsproxy->cgroup_ns != &init_cgroup_ns) {
+		/* Don't set event->cgrp if task belongs to root cgroup */
+		if (task_css_is_root(current, perf_event_cgrp_id))
+			return ret;
+
+		css = task_css(current, perf_event_cgrp_id);
+		if (!css || !css_tryget_online(css))
+			return -ENOENT;
+	} else {
+		/*
+		 * perf invoked from global context and hence don't set
+		 * event->cgrp as all the events should be included
+		 */
+		return ret;
 	}
 
 	cgrp = container_of(css, struct perf_cgroup, css);
@@ -789,8 +810,10 @@ static inline int perf_cgroup_connect(int fd, struct perf_event *event,
 		perf_detach_cgroup(event);
 		ret = -EINVAL;
 	}
-out:
-	fdput(f);
+
+	if (fd != -1)
+		fdput(f);
+
 	return ret;
 }
 
@@ -8864,11 +8887,9 @@ perf_event_alloc(struct perf_event_attr *attr, int cpu,
 	if (!has_branch_stack(event))
 		event->attr.branch_sample_type = 0;
 
-	if (cgroup_fd != -1) {
-		err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
-		if (err)
-			goto err_ns;
-	}
+	err = perf_cgroup_connect(cgroup_fd, event, attr, group_leader);
+	if (err)
+		goto err_ns;
 
 	pmu = perf_init_event(event);
 	if (!pmu)