linux-kernel - Re: [PATCH RFC net-next v2] page_pool: import Jesper's page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAC_iWjLmO4XZ_+PBaCNxpVCTmGKNBsLGyeeKS2ptRrepn1u0SQ@mail.gmail.com>
Date: Wed, 4 Jun 2025 10:04:58 +0300
From: Ilias Apalodimas <ilias.apalodimas@...aro.org>
To: Arnaldo Carvalho de Melo <acme@...nel.org>, Toke Høiland-Jørgensen <toke@...hat.com>, 
	Mina Almasry <almasrymina@...gle.com>, Jesper Dangaard Brouer <hawk@...nel.org>
Cc: netdev@...r.kernel.org, linux-kernel@...r.kernel.org, 
	linux-kselftest@...r.kernel.org, "David S. Miller" <davem@...emloft.net>, 
	Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, 
	Simon Horman <horms@...nel.org>, Shuah Khan <shuah@...nel.org>
Subject: Re: [PATCH RFC net-next v2] page_pool: import Jesper's page_pool benchmark

Hi all,

This is very useful.

On Wed, 28 May 2025 at 16:51, Arnaldo Carvalho de Melo <acme@...nel.org> wrote:
>
> On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
> > Mina Almasry <almasrymina@...gle.com> writes:
> > > On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
> > >> Back when you posted the first RFC, Jesper and I chatted about ways to
> > >> avoid the ugly "load module and read the output from dmesg" interface to
> > >> the test.
>
> > > I agree the existing interface is ugly.
>
> > >> One idea we came up with was to make the module include only the "inner"
> > >> functions for the benchmark, and expose those to BPF as kfuncs. Then the
> > >> test runner can be a BPF program that runs the tests, collects the data
> > >> and passes it to userspace via maps or a ringbuffer or something. That's
> > >> a nicer and more customisable interface than the printk output. And if
> > >> they're small enough, maybe we could even include the functions into the
> > >> page_pool code itself, instead of in a separate benchmark module?
>
> > >> WDYT of that idea? :)
>
> > > ...but this sounds like an enormous amount of effort, for something
> > > that is a bit ugly but isn't THAT bad. Especially for me, I'm not that
> > > much of an expert that I know how to implement what you're referring
> > > to off the top of my head. I normally am open to spending time but
> > > this is not that high on my todolist and I have limited bandwidth to
> > > resolve this :(
>
> > > I also feel that this is something that could be improved post merge.
>
> agreed
>
> > > I think it's very beneficial to have this merged in some form that can
> > > be improved later. Byungchul is making a lot of changes to these mm
> > > things and it would be nice to have an easy way to run the benchmark
> > > in tree and maybe even get automated results from nipa. If we could
> > > agree on mvp that is appropriate to merge without too much scope creep
> > > that would be ideal from my side at least.
>
> > Right, fair. I guess we can merge it as-is, and then investigate whether
> > we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
>
> tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use
> it from a 'perf bench' suite.
>
> Yeah, the model would be what I did for uprobes, but even then there is
> a selftests based uprobes benchmark ;-)
>
> The 'perf bench' part, that calls into the skel:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/uprobe.c
>
> The skel:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel/bench_uprobe.bpf.c
>
> While this one is just to generate BPF load to measure the impact on
> uprobes, for your case it would involve using a ring buffer to
> communicate from the skel (BPF/kernel side) to the userspace part,
> similar to what is done in various other BPF based perf tooling
> available in:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel
>
> Like at this line (BPF skel part):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/bpf_skel/off_cpu.bpf.c?h=perf-tools-next#n253
>
> The simplest part is in the canonical, standalone runqslower tool, also
> hosted in the kernel sources:
>
> BPF skel sending stuff to userspace:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.bpf.c#n99
>
> The userspace part that reads it:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n90
>
> This is a callback that gets called for every event that the BPF skel
> produces, called from this loop:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n162
>
> That handle_event callback was associated via:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n153
>
> There is a dissection I did about this process a long time ago, but
> still relevant, I think:
>
> http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/33
>
> The part explaining the interaction userspace/kernel starts here:
>
> http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/40
>
> (yeah, its http, but then, its _old_vger ;-)
>
> Doing it in perf is interesting because it gets widely packaged, so
> whatever you add to it gets visibility for people using 'perf bench' and
> also gets available in most places, it would add to this collection:
>
> root@...ber:~# perf bench
> Usage:
>         perf bench [<common options>] <collection> <benchmark> [<options>]
>
>         # List of all available benchmark collections:
>
>          sched: Scheduler and IPC benchmarks
>        syscall: System call benchmarks
>            mem: Memory access benchmarks
>           numa: NUMA scheduling and MM benchmarks
>          futex: Futex stressing benchmarks
>          epoll: Epoll stressing benchmarks
>      internals: Perf-internals benchmarks
>     breakpoint: Breakpoint benchmarks
>         uprobe: uprobe benchmarks
>            all: All benchmarks
>
> root@...ber:~#
>
> the 'perf bench' that uses BPF skel:
>
> root@...ber:~# perf bench uprobe baseline
> # Running 'uprobe/baseline' benchmark:
> # Executed 1,000 usleep(1000) calls
>      Total time: 1,050,383 usecs
>
>  1,050.383 usecs/op
> root@...ber:~# perf trace  --summary perf bench uprobe trace_printk
> # Running 'uprobe/trace_printk' benchmark:
> # Executed 1,000 usleep(1000) calls
>      Total time: 1,053,082 usecs
>
>  1,053.082 usecs/op
>
>  Summary of events:
>
>  uprobe-trace_pr (1247691), 3316 events, 96.9%
>
>    syscall            calls  errors  total       min       avg       max       stddev
>                                      (msec)    (msec)    (msec)    (msec)        (%)
>    --------------- --------  ------ -------- --------- --------- ---------     ------
>    clock_nanosleep     1000      0  1101.236     1.007     1.101    50.939      4.53%
>    close                 98      0    32.979     0.001     0.337    32.821     99.52%
>    perf_event_open        1      0    18.691    18.691    18.691    18.691      0.00%
>    mmap                 209      0     0.567     0.001     0.003     0.007      2.59%
>    bpf                   38      2     0.380     0.000     0.010     0.092     28.38%
>    openat                65      0     0.171     0.001     0.003     0.012      7.14%
>    mprotect              56      0     0.141     0.001     0.003     0.008      6.86%
>    read                  68      0     0.082     0.001     0.001     0.010     11.60%
>    fstat                 65      0     0.056     0.001     0.001     0.003      5.40%
>    brk                   10      0     0.050     0.001     0.005     0.012     24.29%
>    pread64                8      0     0.042     0.001     0.005     0.021     49.29%
> <SNIP other syscalls>
>
> root@...ber:~#

Thanks for all the pointers here.
Overall I agree we should merge this. Yes it's not ideal, but we've
been pointing people to run it over several years before accepting
patches. Having it out of tree doesn't help much. It's a test, it's a
bit ugly now, but it serves our purpose and the maintenance burden is
minimal.

Acked-by: Ilias Apalodimas <ilias.apalodimas@...aro.org>
>
> - Arnaldo