netdev - Re: [PATCH RFC net-next v2] page_pool: import Jesper's page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aDcU51dx0N9d-aHz@x1>
Date: Wed, 28 May 2025 10:51:35 -0300
From: Arnaldo Carvalho de Melo <acme@...nel.org>
To: Toke Høiland-Jørgensen <toke@...hat.com>
Cc: Mina Almasry <almasrymina@...gle.com>, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Simon Horman <horms@...nel.org>, Shuah Khan <shuah@...nel.org>,
	Ilias Apalodimas <ilias.apalodimas@...aro.org>
Subject: Re: [PATCH RFC net-next v2] page_pool: import Jesper's page_pool
 benchmark

On Wed, May 28, 2025 at 11:28:54AM +0200, Toke Høiland-Jørgensen wrote:
> Mina Almasry <almasrymina@...gle.com> writes:
> > On Mon, May 26, 2025 at 5:51 AM Toke Høiland-Jørgensen <toke@...hat.com> wrote:
> >> Back when you posted the first RFC, Jesper and I chatted about ways to
> >> avoid the ugly "load module and read the output from dmesg" interface to
> >> the test.

> > I agree the existing interface is ugly.

> >> One idea we came up with was to make the module include only the "inner"
> >> functions for the benchmark, and expose those to BPF as kfuncs. Then the
> >> test runner can be a BPF program that runs the tests, collects the data
> >> and passes it to userspace via maps or a ringbuffer or something. That's
> >> a nicer and more customisable interface than the printk output. And if
> >> they're small enough, maybe we could even include the functions into the
> >> page_pool code itself, instead of in a separate benchmark module?

> >> WDYT of that idea? :)

> > ...but this sounds like an enormous amount of effort, for something
> > that is a bit ugly but isn't THAT bad. Especially for me, I'm not that
> > much of an expert that I know how to implement what you're referring
> > to off the top of my head. I normally am open to spending time but
> > this is not that high on my todolist and I have limited bandwidth to
> > resolve this :(

> > I also feel that this is something that could be improved post merge.

agreed

> > I think it's very beneficial to have this merged in some form that can
> > be improved later. Byungchul is making a lot of changes to these mm
> > things and it would be nice to have an easy way to run the benchmark
> > in tree and maybe even get automated results from nipa. If we could
> > agree on mvp that is appropriate to merge without too much scope creep
> > that would be ideal from my side at least.
 
> Right, fair. I guess we can merge it as-is, and then investigate whether
> we can move it to BPF-based (or maybe 'perf bench' - Cc acme) later :)

tldr; I'd advise to merge it as-is, then kfunc'ify parts of it and use
it from a 'perf bench' suite.

Yeah, the model would be what I did for uprobes, but even then there is
a selftests based uprobes benchmark ;-)

The 'perf bench' part, that calls into the skel:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/bench/uprobe.c

The skel:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel/bench_uprobe.bpf.c

While this one is just to generate BPF load to measure the impact on
uprobes, for your case it would involve using a ring buffer to
communicate from the skel (BPF/kernel side) to the userspace part,
similar to what is done in various other BPF based perf tooling
available in:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf/util/bpf_skel

Like at this line (BPF skel part):

https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/bpf_skel/off_cpu.bpf.c?h=perf-tools-next#n253

The simplest part is in the canonical, standalone runqslower tool, also
hosted in the kernel sources:

BPF skel sending stuff to userspace:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.bpf.c#n99

The userspace part that reads it:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n90

This is a callback that gets called for every event that the BPF skel
produces, called from this loop:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n162

That handle_event callback was associated via:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/runqslower/runqslower.c#n153

There is a dissection I did about this process a long time ago, but
still relevant, I think:

http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/33

The part explaining the interaction userspace/kernel starts here:

http://oldvger.kernel.org/~acme/bpf/devconf.cz-2020-BPF-The-Status-of-BTF-producers-consumers/#/40

(yeah, its http, but then, its _old_vger ;-)

Doing it in perf is interesting because it gets widely packaged, so
whatever you add to it gets visibility for people using 'perf bench' and
also gets available in most places, it would add to this collection:

root@...ber:~# perf bench
Usage: 
	perf bench [<common options>] <collection> <benchmark> [<options>]

        # List of all available benchmark collections:

         sched: Scheduler and IPC benchmarks
       syscall: System call benchmarks
           mem: Memory access benchmarks
          numa: NUMA scheduling and MM benchmarks
         futex: Futex stressing benchmarks
         epoll: Epoll stressing benchmarks
     internals: Perf-internals benchmarks
    breakpoint: Breakpoint benchmarks
        uprobe: uprobe benchmarks
           all: All benchmarks

root@...ber:~#

the 'perf bench' that uses BPF skel:

root@...ber:~# perf bench uprobe baseline
# Running 'uprobe/baseline' benchmark:
# Executed 1,000 usleep(1000) calls
     Total time: 1,050,383 usecs

 1,050.383 usecs/op
root@...ber:~# perf trace  --summary perf bench uprobe trace_printk
# Running 'uprobe/trace_printk' benchmark:
# Executed 1,000 usleep(1000) calls
     Total time: 1,053,082 usecs

 1,053.082 usecs/op

 Summary of events:

 uprobe-trace_pr (1247691), 3316 events, 96.9%

   syscall            calls  errors  total       min       avg       max       stddev
                                     (msec)    (msec)    (msec)    (msec)        (%)
   --------------- --------  ------ -------- --------- --------- ---------     ------
   clock_nanosleep     1000      0  1101.236     1.007     1.101    50.939      4.53%
   close                 98      0    32.979     0.001     0.337    32.821     99.52%
   perf_event_open        1      0    18.691    18.691    18.691    18.691      0.00%
   mmap                 209      0     0.567     0.001     0.003     0.007      2.59%
   bpf                   38      2     0.380     0.000     0.010     0.092     28.38%
   openat                65      0     0.171     0.001     0.003     0.012      7.14%
   mprotect              56      0     0.141     0.001     0.003     0.008      6.86%
   read                  68      0     0.082     0.001     0.001     0.010     11.60%
   fstat                 65      0     0.056     0.001     0.001     0.003      5.40%
   brk                   10      0     0.050     0.001     0.005     0.012     24.29%
   pread64                8      0     0.042     0.001     0.005     0.021     49.29%
<SNIP other syscalls>

root@...ber:~#

- Arnaldo