[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDcU51dx0N9d-aHz@x1>
Date: Wed, 28 May 2025 10:51:35 -0300
From: Arnaldo Carvalho de Melo <acme@...nel.org>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Mina Almasry <almasrymina@...gle.com>, netdev@...r.kernel.org,
linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
Jesper Dangaard Brouer <hawk@...nel.org>,
"David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Simon Horman <horms@...nel.org>, Shuah Khan <shuah@...nel.org>,
Ilias Apalodimas <ilias.apalodimas@...aro.org>
Subject: Re: [PATCH RFC net-next v2] page_pool: import Jesper's page_pool
benchmark
On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
> Mina Almasry <almasrymina@...gle.com> writes:
> > On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
> >> Back when you posted the first RFC, Jesper and I chatted about ways to
> >> avoid the ugly "load module and read the output from dmesg" interface to
> >> the test.
> > I agree the existing interface is ugly.
> >> One idea we came up with was to make the module include only the "inner"
> >> functions for the benchmark, and expose those to BPF as kfuncs. Then the
> >> test runner can be a BPF program that runs the tests, collects the data
> >> and passes it to userspace via maps or a ringbuffer or something. That's
> >> a nicer and more customisable interface than the printk output. And if
> >> they're small enough, maybe we could even include the functions into the
> >> page_pool code itself, instead of in a separate benchmark module?
> >> WDYT of that idea? :)
> > ...but this sounds like an enormous amount of effort, for something
> > that is a bit ugly but isn't THAT bad. Especially for me, I'm not that
> > much of an expert that I know how to implement what you're referring
> > to off the top of my head. I normally am open to spending time but
> > this is not that high on my todolist and I have limited bandwidth to
> > resolve this :(
> > I also feel that this is something that could be improved post merge.
agreed
> > I think it's very beneficial to have this merged in some form that can
> > be improved later. Byungchul is making a lot of changes to these mm
> > things and it would be nice to have an easy way to run the benchmark
> > in tree and maybe even get automated results from nipa. If we could
> > agree on mvp that is appropriate to merge without too much scope creep
> > that would be ideal from my side at least.
> Right, fair. I guess we can merge it as-is, and then investigate whether
> we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use
it from a 'perf bench' suite.
Yeah, the model would be what I did for uprobes, but even then there is
a selftests based uprobes benchmark ;-)
The 'perf bench' part, that calls into the skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/uprobe.c
The skel:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel/bench_uprobe.bpf.c
While this one is just to generate BPF load to measure the impact on
uprobes, for your case it would involve using a ring buffer to
communicate from the skel (BPF/kernel side) to the userspace part,
similar to what is done in various other BPF based perf tooling
available in:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel
Like at this line (BPF skel part):
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/bpf_skel/off_cpu.bpf.c?h=perf-tools-next#n253
The simplest part is in the canonical, standalone runqslower tool, also
hosted in the kernel sources:
BPF skel sending stuff to userspace:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.bpf.c#n99
The userspace part that reads it:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n90
This is a callback that gets called for every event that the BPF skel
produces, called from this loop:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n162
That handle_event callback was associated via:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n153
There is a dissection I did about this process a long time ago, but
still relevant, I think:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/33
The part explaining the interaction userspace/kernel starts here:
http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/40
(yeah, its http, but then, its _old_vger ;-)
Doing it in perf is interesting because it gets widely packaged, so
whatever you add to it gets visibility for people using 'perf bench' and
also gets available in most places, it would add to this collection:
root@...ber:~# perf bench
Usage:
perf bench [<common options>] <collection> <benchmark> [<options>]
# List of all available benchmark collections:
sched: Scheduler and IPC benchmarks
syscall: System call benchmarks
mem: Memory access benchmarks
numa: NUMA scheduling and MM benchmarks
futex: Futex stressing benchmarks
epoll: Epoll stressing benchmarks
internals: Perf-internals benchmarks
breakpoint: Breakpoint benchmarks
uprobe: uprobe benchmarks
all: All benchmarks
root@...ber:~#
the 'perf bench' that uses BPF skel:
root@...ber:~# perf bench uprobe baseline
# Running 'uprobe/baseline' benchmark:
# Executed 1,000 usleep(1000) calls
Total time: 1,050,383 usecs
1,050.383 usecs/op
root@...ber:~# perf trace --summary perf bench uprobe trace_printk
# Running 'uprobe/trace_printk' benchmark:
# Executed 1,000 usleep(1000) calls
Total time: 1,053,082 usecs
1,053.082 usecs/op
Summary of events:
uprobe-trace_pr (1247691), 3316 events, 96.9%
syscall calls errors total min avg max stddev
(msec) (msec) (msec) (msec) (%)
--------------- -------- ------ -------- --------- --------- --------- ------
clock_nanosleep 1000 0 1101.236 1.007 1.101 50.939 4.53%
close 98 0 32.979 0.001 0.337 32.821 99.52%
perf_event_open 1 0 18.691 18.691 18.691 18.691 0.00%
mmap 209 0 0.567 0.001 0.003 0.007 2.59%
bpf 38 2 0.380 0.000 0.010 0.092 28.38%
openat 65 0 0.171 0.001 0.003 0.012 7.14%
mprotect 56 0 0.141 0.001 0.003 0.008 6.86%
read 68 0 0.082 0.001 0.001 0.010 11.60%
fstat 65 0 0.056 0.001 0.001 0.003 5.40%
brk 10 0 0.050 0.001 0.005 0.012 24.29%
pread64 8 0 0.042 0.001 0.005 0.021 49.29%
<SNIP other syscalls>
root@...ber:~#
- Arnaldo
Powered by blists - more mailing lists