linux-kernel - possible bpf overflow/output bug introduced in 6.10?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <Zo64cpho2cFQiOeE@LQ3V64L9R2>
Date: Wed, 10 Jul 2024 09:36:02 -0700
From: Joe Damato <jdamato@...tly.com>
To: me@...ehuey.com
Cc: acme@...nel.org, andrii.nakryiko@...il.com, bpf@...r.kernel.org,
	elver@...gle.com, jolsa@...nel.org, khuey@...ehuey.com,
	linux-kernel@...r.kernel.org, mingo@...nel.org, namhyung@...nel.org,
	peterz@...radead.org, robert@...llahan.org, yonghong.song@...ux.dev,
	mkarsten@...terloo.ca
Subject: possible bpf overflow/output bug introduced in 6.10?

Greetings:

While testing some unrelated networking code with Martin Karsten (cc'd on
this email) we discovered what appears to be some sort of overflow bug in
bpf.

git bisect suggests that commit f11f10bfa1ca ("perf/bpf: Call BPF handler
directly, not through overflow machinery") is the first commit where the
(I assume) buggy behavior appears.

Running the following on my machine as of the commit mentioned above:

  bpftrace -e 'tracepoint:napi:napi_poll { @[args->work] = count(); }'

while simultaneously transferring data to the target machine (in my case, I
scp'd a 100MiB file of zeros in a loop) results in very strange output
(snipped):

  @[11]: 5
  @[18]: 5
  @[-30590]: 6
  @[10]: 7
  @[14]: 9

It does not seem that the driver I am using on my test system (mlx5) would
ever return a negative value from its napi poll function and likewise for
the driver Martin is using (mlx4).

As such, I don't think it is possible for args->work to ever be a large
negative number, but perhaps I am misunderstanding something?

I would like to note that commit 14e40a9578b7 ("perf/bpf: Remove #ifdef
CONFIG_BPF_SYSCALL from struct perf_event members") does not exhibit this
behavior and the output seems reasonable on my test system. Martin confirms
the same for both commits on his test system, which uses different hardware
than mine.

Is this an expected side effect of this change? I would expect it is not
and that the output is a bug of some sort. My apologies in that I am not
particularly familiar with the bpf code and cannot suggest what the root
cause might be.

If it is not a bug, can anyone suggest what this output might mean or
how the script run above should be modified?

Thanks,
Joe