linux-kernel - [RFC Patch] perf_event: fix a cgroup switch warning

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20190514002747.7047-1-xiyou.wangcong@gmail.com>
Date:   Mon, 13 May 2019 17:27:47 -0700
From:   Cong Wang <xiyou.wangcong@...il.com>
To:     linux-kernel@...r.kernel.org
Cc:     Cong Wang <xiyou.wangcong@...il.com>,
        Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        Arnaldo Carvalho de Melo <acme@...nel.org>,
        Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
        Jiri Olsa <jolsa@...hat.com>,
        Namhyung Kim <namhyung@...nel.org>
Subject: [RFC Patch] perf_event: fix a cgroup switch warning

We have been consistently triggering the warning
WARN_ON_ONCE(cpuctx->cgrp) in perf_cgroup_switch() for a rather
long time, although we still have no clue on how to reproduce it.

Looking into the code, it seems the only possibility here is that
the process calling perf_event_open() with a cgroup target exits
before the process in the target cgroup exits but after it gains
CPU to run. This is because we use the atomic counter
perf_cgroup_events as an indication of whether cgroup perf event
has enabled or not, which is inaccurate, illustrated as below:

CPU 0					CPU 1
// open perf events with a cgroup
// target for all CPU's
perf_event_open():
  account_event_cpu()
  // perf_cgroup_events == 1
				// Schedule in a process in the target cgroup
				perf_cgroup_switch()
perf_event_release_kernel():
  unaccount_event_cpu()
  // perf_cgroup_events == 0
				// schedule out
				// but perf_cgroup_sched_out() is skipped
				// cpuctx->cgrp left as non-NULL

				// schedule in another process
				perf_cgroup_switch() // WARN triggerred

The proposed fix is kinda ugly, as it adds a flag in each process to
indicate whether this process has to go through perf_cgroup_sched_out()
when perf_cgroup_events is false negative. The other possible fix is
to force a reschedule on each target CPU before decreasing the counter
perf_cgroup_events, but this is expensive.

Suggestions? Thoughts?

Cc: Ingo Molnar <mingo@...hat.com>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Arnaldo Carvalho de Melo <acme@...nel.org>
Cc: Alexander Shishkin <alexander.shishkin@...ux.intel.com>
Cc: Jiri Olsa <jolsa@...hat.com>
Cc: Namhyung Kim <namhyung@...nel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@...il.com>
---
 include/linux/sched.h | 3 +++
 kernel/events/core.c  | 5 ++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2cd15855bad..835bdf15f92c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -733,6 +733,9 @@ struct task_struct {
 	/* to be used once the psi infrastructure lands upstream. */
 	unsigned			use_memdelay:1;
 #endif
+#ifdef CONFIG_PERF_EVENTS
+	unsigned			perf_cgroup_sched_in:1;
+#endif
 
 	unsigned long			atomic_flags; /* Flags requiring atomic access. */
 
diff --git a/kernel/events/core.c b/kernel/events/core.c
index abbd4b3b96c2..9b86b043018e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -817,6 +817,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			 * to event_filter_match() in event_sched_out()
 			 */
 			cpuctx->cgrp = NULL;
+			task->perf_cgroup_sched_in = 0;
 		}
 
 		if (mode & PERF_CGROUP_SWIN) {
@@ -831,6 +832,7 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
 			cpuctx->cgrp = perf_cgroup_from_task(task,
 							     &cpuctx->ctx);
 			cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
+			task->perf_cgroup_sched_in = 1;
 		}
 		perf_pmu_enable(cpuctx->ctx.pmu);
 		perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3233,7 +3235,8 @@ void __perf_event_task_sched_out(struct task_struct *task,
 	 * to check if we have to switch out PMU state.
 	 * cgroup event are system-wide mode only
 	 */
-	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)))
+	if (atomic_read(this_cpu_ptr(&perf_cgroup_events)) ||
+	    task->perf_cgroup_sched_in)
 		perf_cgroup_sched_out(task, next);
 }
 
-- 
2.21.0