linux-kernel - [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250529065537.529937-1-howardchu95@gmail.com>
Date: Wed, 28 May 2025 23:55:36 -0700
From: Howard Chu <howardchu95@...il.com>
To: acme@...nel.org
Cc: mingo@...hat.com,
	namhyung@...nel.org,
	mark.rutland@....com,
	alexander.shishkin@...ux.intel.com,
	jolsa@...nel.org,
	irogers@...gle.com,
	adrian.hunter@...el.com,
	peterz@...radead.org,
	kan.liang@...ux.intel.com,
	linux-perf-users@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Howard Chu <howardchu95@...il.com>,
	Song Liu <song@...nel.org>
Subject: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances

perf trace utilizes the tracepoint utility, the only filter in perf
trace is a filter on syscall type. For example, if perf traces only
openat, then it filters all the other syscalls, such as readlinkat,
readv, etc.

This filtering is flawed. Consider this case: two perf trace
instances are running at the same time, trace instance A tracing
readlinkat, trace instance B tracing openat. When an openat syscall
enters, it triggers both BPF programs (sys_enter) in both perf trace
instances, these kernel functions will be executed:

perf_syscall_enter
  perf_call_bpf_enter
    trace_call_bpf
      bpf_prog_run_array

In bpf_prog_run_array:
~~~
while ((prog = READ_ONCE(item->prog))) {
	run_ctx.bpf_cookie = item->bpf_cookie;
	ret &= run_prog(prog, ctx);
	item++;
}
~~~

I'm not a BPF expert, but by tinkering I found that if one of the BPF
programs returns 0, there will be no tracepoint sample. That is,

(Is there a sample?) = ProgRetA | ProgRetB | ProgRetC

Where ProgRetA is the return value of one of the BPF programs in the BPF
program array.

Go back to the case, when two perf trace instances are tracing two
different syscalls, again, A is tracing readlinkat, B is tracing openat,
when an openat syscall enters, it triggers the sys_enter program in
instance A, call it ProgA, and the sys_enter program in instance B,
ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) |
ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
when there really should be one.

I also want to point out that openat and readlinkat have augmented
output, so my example might not be accurate, but it does explain the
current perf-trace-in-parallel dilemma.

Now for augmented output, it is different. When it calls
bpf_perf_event_output, there is a sample. There won't be no ProgRetA |
ProgRetB... thing. So I will send another RFC patch to enable
parallelism using this feature. Also, augmented_output creates a sample
on it's own, so returning 1 will create a duplicated sample, when
augmented, just return 0 instead.

Is this approach perfect? Absolutely not, there will likely be some
performance overhead on the kernel side. It is just a quick dirty fix
that makes perf trace run in parallel without failing. This patch is an
explanation on the reason of failures and possibly, a link used in a
nack comment.

Cc: Song Liu <song@...nel.org>
Signed-off-by: Howard Chu <howardchu95@...il.com>
---
 .../util/bpf_skel/augmented_raw_syscalls.bpf.c   | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
index e4352881e3fa..315fadf01321 100644
--- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
+++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
@@ -546,13 +546,14 @@ int sys_enter(struct syscall_enter_args *args)
 	/*
 	 * Jump to syscall specific augmenter, even if the default one,
 	 * "!raw_syscalls:unaugmented" that will just return 1 to return the
-	 * unaugmented tracepoint payload.
+	 * unaugmented tracepoint payload. If augmented, return 0 to reduce a
+	 * duplicated tracepoint sample.
 	 */
-	if (augment_sys_enter(args, &augmented_args->args))
-		bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);
+	if (!augment_sys_enter(args, &augmented_args->args))
+		return 0;

-	// If not found on the PROG_ARRAY syscalls map, then we're filtering it:
-	return 0;
+	bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);
+	return 1;
 }

 SEC("tp/raw_syscalls/sys_exit")
@@ -570,10 +571,7 @@ int sys_exit(struct syscall_exit_args *args)
 	 * unaugmented tracepoint payload.
 	 */
 	bpf_tail_call(args, &syscalls_sys_exit, exit_args.syscall_nr);
-	/*
-	 * If not found on the PROG_ARRAY syscalls map, then we're filtering it:
-	 */
-	return 0;
+	return 1;
 }

 char _license[] SEC("license") = "GPL";
-- 
2.45.2