linux-kernel - Re: [PATCH RFC net-next v2] page_pool: import Jesper's page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87cybsr72w.fsf@toke.dk>
Date: Wed, 28 May 2025 19:14:15 +0200
From: Toke Høiland-Jørgensen <toke@...hat.com>
To: Arnaldo Carvalho de Melo <acme@...nel.org>
Cc: Mina Almasry <almasrymina@...gle.com>, netdev@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org, Jesper
 Dangaard Brouer <hawk@...nel.org>, "David S. Miller"
 <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski
 <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman
 <horms@...nel.org>, Shuah Khan <shuah@...nel.org>, Ilias Apalodimas
 <ilias.apalodimas@...aro.org>
Subject: Re: [PATCH RFC net-next v2] page_pool: import Jesper's page_pool
 benchmark

Arnaldo Carvalho de Melo <acme@...nel.org> writes:

> On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
>> Mina Almasry <almasrymina@...gle.com> writes:
>> > On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
>> >> Back when you posted the first RFC, Jesper and I chatted about ways to
>> >> avoid the ugly "load module and read the output from dmesg" interface to
>> >> the test.
>
>> > I agree the existing interface is ugly.
>
>> >> One idea we came up with was to make the module include only the "inner"
>> >> functions for the benchmark, and expose those to BPF as kfuncs. Then the
>> >> test runner can be a BPF program that runs the tests, collects the data
>> >> and passes it to userspace via maps or a ringbuffer or something. That's
>> >> a nicer and more customisable interface than the printk output. And if
>> >> they're small enough, maybe we could even include the functions into the
>> >> page_pool code itself, instead of in a separate benchmark module?
>
>> >> WDYT of that idea? :)
>
>> > ...but this sounds like an enormous amount of effort, for something
>> > that is a bit ugly but isn't THAT bad. Especially for me, I'm not that
>> > much of an expert that I know how to implement what you're referring
>> > to off the top of my head. I normally am open to spending time but
>> > this is not that high on my todolist and I have limited bandwidth to
>> > resolve this :(
>
>> > I also feel that this is something that could be improved post merge.
>
> agreed
>
>> > I think it's very beneficial to have this merged in some form that can
>> > be improved later. Byungchul is making a lot of changes to these mm
>> > things and it would be nice to have an easy way to run the benchmark
>> > in tree and maybe even get automated results from nipa. If we could
>> > agree on mvp that is appropriate to merge without too much scope creep
>> > that would be ideal from my side at least.
>  
>> Right, fair. I guess we can merge it as-is, and then investigate whether
>> we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)
>
> tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use
> it from a 'perf bench' suite.
>
> Yeah, the model would be what I did for uprobes, but even then there is
> a selftests based uprobes benchmark ;-)
>
> The 'perf bench' part, that calls into the skel:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/uprobe.c
>
> The skel:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel/bench_uprobe.bpf.c
>
> While this one is just to generate BPF load to measure the impact on
> uprobes, for your case it would involve using a ring buffer to
> communicate from the skel (BPF/kernel side) to the userspace part,
> similar to what is done in various other BPF based perf tooling
> available in:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel
>
> Like at this line (BPF skel part):
>
> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/bpf_skel/off_cpu.bpf.c?h=perf-tools-next#n253
>
> The simplest part is in the canonical, standalone runqslower tool, also
> hosted in the kernel sources:
>
> BPF skel sending stuff to userspace:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.bpf.c#n99
>
> The userspace part that reads it:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n90
>
> This is a callback that gets called for every event that the BPF skel
> produces, called from this loop:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n162
>
> That handle_event callback was associated via:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n153
>
> There is a dissection I did about this process a long time ago, but
> still relevant, I think:
>
> http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/33
>
> The part explaining the interaction userspace/kernel starts here:
>
> http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/40
>
> (yeah, its http, but then, its _old_vger ;-)
>
> Doing it in perf is interesting because it gets widely packaged, so
> whatever you add to it gets visibility for people using 'perf bench' and
> also gets available in most places, it would add to this collection:
>
> root@...ber:~# perf bench
> Usage: 
> 	perf bench [<common options>] <collection> <benchmark> [<options>]
>
>         # List of all available benchmark collections:
>
>          sched: Scheduler and IPC benchmarks
>        syscall: System call benchmarks
>            mem: Memory access benchmarks
>           numa: NUMA scheduling and MM benchmarks
>          futex: Futex stressing benchmarks
>          epoll: Epoll stressing benchmarks
>      internals: Perf-internals benchmarks
>     breakpoint: Breakpoint benchmarks
>         uprobe: uprobe benchmarks
>            all: All benchmarks
>
> root@...ber:~#
>
> the 'perf bench' that uses BPF skel:
>
> root@...ber:~# perf bench uprobe baseline
> # Running 'uprobe/baseline' benchmark:
> # Executed 1,000 usleep(1000) calls
>      Total time: 1,050,383 usecs
>
>  1,050.383 usecs/op
> root@...ber:~# perf trace  --summary perf bench uprobe trace_printk
> # Running 'uprobe/trace_printk' benchmark:
> # Executed 1,000 usleep(1000) calls
>      Total time: 1,053,082 usecs
>
>  1,053.082 usecs/op
>
>  Summary of events:
>
>  uprobe-trace_pr (1247691), 3316 events, 96.9%
>
>    syscall            calls  errors  total       min       avg       max       stddev
>                                      (msec)    (msec)    (msec)    (msec)        (%)
>    --------------- --------  ------ -------- --------- --------- ---------     ------
>    clock_nanosleep     1000      0  1101.236     1.007     1.101    50.939      4.53%
>    close                 98      0    32.979     0.001     0.337    32.821     99.52%
>    perf_event_open        1      0    18.691    18.691    18.691    18.691      0.00%
>    mmap                 209      0     0.567     0.001     0.003     0.007      2.59%
>    bpf                   38      2     0.380     0.000     0.010     0.092     28.38%
>    openat                65      0     0.171     0.001     0.003     0.012      7.14%
>    mprotect              56      0     0.141     0.001     0.003     0.008      6.86%
>    read                  68      0     0.082     0.001     0.001     0.010     11.60%
>    fstat                 65      0     0.056     0.001     0.001     0.003      5.40%
>    brk                   10      0     0.050     0.001     0.005     0.012     24.29%
>    pread64                8      0     0.042     0.001     0.005     0.021     49.29%
> <SNIP other syscalls>
>
> root@...ber:~#

Cool, thanks for the pointers! Guess we'd need to restructure the
functions to be benchmarked a bit, but that should be doable I guess.

-Toke