linux-kernel - Re: [PATCH net-next v4] page_pool: import Jesper's page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c126182c-8f26-41e2-a20d-ceefc2ced886@kernel.org>
Date: Mon, 16 Jun 2025 11:29:33 +0200
From: Jesper Dangaard Brouer <hawk@...nel.org>
To: Mina Almasry <almasrymina@...gle.com>, netdev@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-kselftest@...r.kernel.org
Cc: "David S. Miller" <davem@...emloft.net>,
 Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
 Shuah Khan <shuah@...nel.org>, Ilias Apalodimas
 <ilias.apalodimas@...aro.org>, Toke Høiland-Jørgensen
 <toke@...e.dk>, Ignat Korchagin <ignat@...udflare.com>
Subject: Re: [PATCH net-next v4] page_pool: import Jesper's page_pool
 benchmark



On 15/06/2025 22.59, Mina Almasry wrote:
> From: Jesper Dangaard Brouer <hawk@...nel.org>
> 
> We frequently consult with Jesper's out-of-tree page_pool benchmark to
> evaluate page_pool changes.
> 
> Import the benchmark into the upstream linux kernel tree so that (a)
> we're all running the same version, (b) pave the way for shared
> improvements, and (c) maybe one day integrate it with nipa, if possible.
> 
> Import bench_page_pool_simple from commit 35b1716d0c30 ("Add
> page_bench06_walk_all"), from this repository:
> https://github.com/netoptimizer/prototype-kernel.git
> 
> Changes done during upstreaming:
> - Fix checkpatch issues.
> - Remove the tasklet logic not needed.
> - Move under tools/testing
> - Create ksft for the benchmark.
> - Changed slightly how the benchmark gets build. Out of tree, time_bench
>    is built as an independent .ko. Here it is included in
>    bench_page_pool.ko
> 
> Steps to run:
> 
> ```
> mkdir -p /tmp/run-pp-bench
> make -C ./tools/testing/selftests/net/bench
> make -C ./tools/testing/selftests/net/bench install INSTALL_PATH=/tmp/run-pp-bench
> rsync --delete -avz --progress /tmp/run-pp-bench mina@...RVER:~/
> ssh mina@...RVER << EOF
>    cd ~/run-pp-bench && sudo ./test_bench_page_pool.sh
> EOF
> ```
> 
> Output:
> 
> ```
> (benchmrk dmesg logs)
>

Something is off with benchmark numbers compared to the OOT version.

Adding my numbers below, they were run on my testlab with:
  - CPU E5-1650 v4 @ 3.60GHz
  - kernel: net.git v6.15-12438-gd9816ec74e6d

> Fast path results:
> no-softirq-page_pool01 Per elem: 11 cycles(tsc) 4.368 ns
> 

Fast-path on your CPU is faster (22 cycles(tsc) 6.128 ns) than my CPU.
What CPU is this?

Type:no-softirq-page_pool01 Per elem: 22 cycles(tsc) 6.128 ns (step:0)
  - (measurement period time:0.061282924 sec time_interval:61282924)
  - (invoke count:10000000 tsc_interval:220619745)

> ptr_ring results:
> no-softirq-page_pool02 Per elem: 527 cycles(tsc) 195.187 ns

I'm surprised that ptr_ring benchmark is very slow, compared to my 
result (below) 60 cycles(tsc) 16.853 ns.

Type:no-softirq-page_pool02 Per elem: 60 cycles(tsc) 16.853 ns (step:0)
  - (measurement period time:0.168535760 sec time_interval:168535760)
  - (invoke count:10000000 tsc_interval:606734160)

Maybe your kernel is compiled with some CONFIG debug thing that makes it 
slower?

You can troubleshoot like this:
  - select the `no-softirq-page_pool02` test via run_flags=$((2#100)).

  # perf record -g modprobe bench_page_pool_simple run_flags=$((2#100)) 
loops=$((100*10**6))
  # perf report --no-children

> slow path results:
> no-softirq-page_pool03 Per elem: 549 cycles(tsc) 203.466 ns


Type:no-softirq-page_pool03 Per elem: 265 cycles(tsc) 73.674 ns (step:0)
  - (measurement period time:0.736740796 sec time_interval:736740796)
  - (invoke count:10000000 tsc_interval:2652295113)


--Jesper


> ```
> 
> Cc: Jesper Dangaard Brouer <hawk@...nel.org>
> Cc: Ilias Apalodimas <ilias.apalodimas@...aro.org>
> Cc: Jakub Kicinski <kuba@...nel.org>
> Cc: Toke Høiland-Jørgensen <toke@...e.dk>
> 
> Signed-off-by: Mina Almasry <almasrymina@...gle.com>
> Acked-by: Ilias Apalodimas <ilias.apalodimas@...aro.org>
> Signed-off-by: Jesper Dangaard Brouer <hawk@...nel.org>
> 
> ---
> 
> v4: https://lore.kernel.org/netdev/20250614100853.3f2372f2@kernel.org/
> 
> - Fix more checkpatch and coccicheck issues (Jakub)
> 
> v3:
> - Non RFC
> - Collect Signed-off-by from Jesper and Acked-by Ilias.
> - Move test_bench_page_pool.sh to address nipa complaint.
> - Remove `static inline` in .c files to address nipa complaint.
> 
> v2:
> - Move under tools/selftests (Jakub)
> - Create ksft for it.
> - Remove the tasklet logic no longer needed (Jesper + Toke)
> 
> RFC discussion points:
> - Desirable to import it?
> - Can the benchmark be imported as-is for an initial version? Or needs
>    lots of modifications?
>    - Code location. I retained the location in Jesper's tree, but a path
>      like net/core/bench/ may make more sense.
> 
> ---
>   tools/testing/selftests/net/bench/Makefile    |   7 +
>   .../selftests/net/bench/page_pool/Makefile    |  17 +
>   .../bench/page_pool/bench_page_pool_simple.c  | 276 ++++++++++++
>   .../net/bench/page_pool/time_bench.c          | 394 ++++++++++++++++++
>   .../net/bench/page_pool/time_bench.h          | 238 +++++++++++
>   .../net/bench/test_bench_page_pool.sh         |  32 ++
>   6 files changed, 964 insertions(+)
>   create mode 100644 tools/testing/selftests/net/bench/Makefile
>   create mode 100644 tools/testing/selftests/net/bench/page_pool/Makefile
>   create mode 100644 tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
>   create mode 100644 tools/testing/selftests/net/bench/page_pool/time_bench.c
>   create mode 100644 tools/testing/selftests/net/bench/page_pool/time_bench.h
>   create mode 100755 tools/testing/selftests/net/bench/test_bench_page_pool.sh
> 
> diff --git a/tools/testing/selftests/net/bench/Makefile b/tools/testing/selftests/net/bench/Makefile
> new file mode 100644
> index 000000000000..2546c45e42f7
> --- /dev/null
> +++ b/tools/testing/selftests/net/bench/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +TEST_GEN_MODS_DIR := page_pool
> +
> +TEST_PROGS += test_bench_page_pool.sh
> +
> +include ../../lib.mk
> diff --git a/tools/testing/selftests/net/bench/page_pool/Makefile b/tools/testing/selftests/net/bench/page_pool/Makefile
> new file mode 100644
> index 000000000000..0549a16ba275
> --- /dev/null
> +++ b/tools/testing/selftests/net/bench/page_pool/Makefile
> @@ -0,0 +1,17 @@
> +BENCH_PAGE_POOL_SIMPLE_TEST_DIR := $(realpath $(dir $(abspath $(lastword $(MAKEFILE_LIST)))))
> +KDIR ?= /lib/modules/$(shell uname -r)/build
> +
> +ifeq ($(V),1)
> +Q =
> +else
> +Q = @
> +endif
> +
> +obj-m	+= bench_page_pool.o
> +bench_page_pool-y += bench_page_pool_simple.o time_bench.o
> +
> +all:
> +	+$(Q)make -C $(KDIR) M=$(BENCH_PAGE_POOL_SIMPLE_TEST_DIR) modules
> +
> +clean:
> +	+$(Q)make -C $(KDIR) M=$(BENCH_PAGE_POOL_SIMPLE_TEST_DIR) clean
> diff --git a/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
> new file mode 100644
> index 000000000000..f183d5e30dc6
> --- /dev/null
> +++ b/tools/testing/selftests/net/bench/page_pool/bench_page_pool_simple.c
> @@ -0,0 +1,276 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Benchmark module for page_pool.
> + *
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +
> +#include <linux/version.h>
> +#include <net/page_pool/helpers.h>
> +
> +#include <linux/interrupt.h>
> +#include <linux/limits.h>
> +
> +#include "time_bench.h"
> +
> +static int verbose = 1;
> +#define MY_POOL_SIZE 1024
> +
> +static void _page_pool_put_page(struct page_pool *pool, struct page *page,
> +				bool allow_direct)
> +{
> +	page_pool_put_page(pool, page, -1, allow_direct);
> +}
> +
> +/* Makes tests selectable. Useful for perf-record to analyze a single test.
> + * Hint: Bash shells support writing binary number like: $((2#101010)
> + *
> + * # modprobe bench_page_pool_simple run_flags=$((2#100))
> + */
> +static unsigned long run_flags = 0xFFFFFFFF;
> +module_param(run_flags, ulong, 0);
> +MODULE_PARM_DESC(run_flags, "Limit which bench test that runs");
> +
> +/* Count the bit number from the enum */
> +enum benchmark_bit {
> +	bit_run_bench_baseline,
> +	bit_run_bench_no_softirq01,
> +	bit_run_bench_no_softirq02,
> +	bit_run_bench_no_softirq03,
> +};
> +
> +#define bit(b)		(1 << (b))
> +#define enabled(b)	((run_flags & (bit(b))))
> +
> +/* notice time_bench is limited to U32_MAX nr loops */
> +static unsigned long loops = 10000000;
> +module_param(loops, ulong, 0);
> +MODULE_PARM_DESC(loops, "Specify loops bench will run");
> +
> +/* Timing at the nanosec level, we need to know the overhead
> + * introduced by the for loop itself
> + */
> +static int time_bench_for_loop(struct time_bench_record *rec, void *data)
> +{
> +	uint64_t loops_cnt = 0;
> +	int i;
> +
> +	time_bench_start(rec);
> +	/** Loop to measure **/
> +	for (i = 0; i < rec->loops; i++) {
> +		loops_cnt++;
> +		barrier(); /* avoid compiler to optimize this loop */
> +	}
> +	time_bench_stop(rec, loops_cnt);
> +	return loops_cnt;
> +}
> +
> +static int time_bench_atomic_inc(struct time_bench_record *rec, void *data)
> +{
> +	uint64_t loops_cnt = 0;
> +	atomic_t cnt;
> +	int i;
> +
> +	atomic_set(&cnt, 0);
> +
> +	time_bench_start(rec);
> +	/** Loop to measure **/
> +	for (i = 0; i < rec->loops; i++) {
> +		atomic_inc(&cnt);
> +		barrier(); /* avoid compiler to optimize this loop */
> +	}
> +	loops_cnt = atomic_read(&cnt);
> +	time_bench_stop(rec, loops_cnt);
> +	return loops_cnt;
> +}
> +
> +/* The ptr_ping in page_pool uses a spinlock. We need to know the minimum
> + * overhead of taking+releasing a spinlock, to know the cycles that can be saved
> + * by e.g. amortizing this via bulking.
> + */
> +static int time_bench_lock(struct time_bench_record *rec, void *data)
> +{
> +	uint64_t loops_cnt = 0;
> +	spinlock_t lock;
> +	int i;
> +
> +	spin_lock_init(&lock);
> +
> +	time_bench_start(rec);
> +	/** Loop to measure **/
> +	for (i = 0; i < rec->loops; i++) {
> +		spin_lock(&lock);
> +		loops_cnt++;
> +		barrier(); /* avoid compiler to optimize this loop */
> +		spin_unlock(&lock);
> +	}
> +	time_bench_stop(rec, loops_cnt);
> +	return loops_cnt;
> +}
> +
> +/* Helper for filling some page's into ptr_ring */
> +static void pp_fill_ptr_ring(struct page_pool *pp, int elems)
> +{
> +	/* GFP_ATOMIC needed when under run softirq */
> +	gfp_t gfp_mask = GFP_ATOMIC;
> +	struct page **array;
> +	int i;
> +
> +	array = kcalloc(elems, sizeof(struct page *), gfp_mask);
> +
> +	for (i = 0; i < elems; i++)
> +		array[i] = page_pool_alloc_pages(pp, gfp_mask);
> +	for (i = 0; i < elems; i++)
> +		_page_pool_put_page(pp, array[i], false);
> +
> +	kfree(array);
> +}
> +
> +enum test_type { type_fast_path, type_ptr_ring, type_page_allocator };
> +
> +/* Depends on compile optimizing this function */
> +static int time_bench_page_pool(struct time_bench_record *rec, void *data,
> +				enum test_type type, const char *func)
> +{
> +	uint64_t loops_cnt = 0;
> +	gfp_t gfp_mask = GFP_ATOMIC; /* GFP_ATOMIC is not really needed */
> +	int i, err;
> +
> +	struct page_pool *pp;
> +	struct page *page;
> +
> +	struct page_pool_params pp_params = {
> +		.order = 0,
> +		.flags = 0,
> +		.pool_size = MY_POOL_SIZE,
> +		.nid = NUMA_NO_NODE,
> +		.dev = NULL, /* Only use for DMA mapping */
> +		.dma_dir = DMA_BIDIRECTIONAL,
> +	};
> +
> +	pp = page_pool_create(&pp_params);
> +	if (IS_ERR(pp)) {
> +		err = PTR_ERR(pp);
> +		pr_warn("%s: Error(%d) creating page_pool\n", func, err);
> +		goto out;
> +	}
> +	pp_fill_ptr_ring(pp, 64);
> +
> +	if (in_serving_softirq())
> +		pr_warn("%s(): in_serving_softirq fast-path\n", func);
> +	else
> +		pr_warn("%s(): Cannot use page_pool fast-path\n", func);
> +
> +	time_bench_start(rec);
> +	/** Loop to measure **/
> +	for (i = 0; i < rec->loops; i++) {
> +		/* Common fast-path alloc that depend on in_serving_softirq() */
> +		page = page_pool_alloc_pages(pp, gfp_mask);
> +		if (!page)
> +			break;
> +		loops_cnt++;
> +		barrier(); /* avoid compiler to optimize this loop */
> +
> +		/* The benchmarks purpose it to test different return paths.
> +		 * Compiler should inline optimize other function calls out
> +		 */
> +		if (type == type_fast_path) {
> +			/* Fast-path recycling e.g. XDP_DROP use-case */
> +			page_pool_recycle_direct(pp, page);
> +
> +		} else if (type == type_ptr_ring) {
> +			/* Normal return path */
> +			_page_pool_put_page(pp, page, false);
> +
> +		} else if (type == type_page_allocator) {
> +			/* Test if not pages are recycled, but instead
> +			 * returned back into systems page allocator
> +			 */
> +			get_page(page); /* cause no-recycling */
> +			_page_pool_put_page(pp, page, false);
> +			put_page(page);
> +		} else {
> +			BUILD_BUG();
> +		}
> +	}
> +	time_bench_stop(rec, loops_cnt);
> +out:
> +	page_pool_destroy(pp);
> +	return loops_cnt;
> +}
> +
> +static int time_bench_page_pool01_fast_path(struct time_bench_record *rec,
> +					    void *data)
> +{
> +	return time_bench_page_pool(rec, data, type_fast_path, __func__);
> +}
> +
> +static int time_bench_page_pool02_ptr_ring(struct time_bench_record *rec,
> +					   void *data)
> +{
> +	return time_bench_page_pool(rec, data, type_ptr_ring, __func__);
> +}
> +
> +static int time_bench_page_pool03_slow(struct time_bench_record *rec,
> +				       void *data)
> +{
> +	return time_bench_page_pool(rec, data, type_page_allocator, __func__);
> +}
> +
> +static int run_benchmark_tests(void)
> +{
> +	uint32_t nr_loops = loops;
> +
> +	/* Baseline tests */
> +	if (enabled(bit_run_bench_baseline)) {
> +		time_bench_loop(nr_loops * 10, 0, "for_loop", NULL,
> +				time_bench_for_loop);
> +		time_bench_loop(nr_loops * 10, 0, "atomic_inc", NULL,
> +				time_bench_atomic_inc);
> +		time_bench_loop(nr_loops, 0, "lock", NULL, time_bench_lock);
> +	}
> +
> +	/* This test cannot activate correct code path, due to no-softirq ctx */
> +	if (enabled(bit_run_bench_no_softirq01))
> +		time_bench_loop(nr_loops, 0, "no-softirq-page_pool01", NULL,
> +				time_bench_page_pool01_fast_path);
> +	if (enabled(bit_run_bench_no_softirq02))
> +		time_bench_loop(nr_loops, 0, "no-softirq-page_pool02", NULL,
> +				time_bench_page_pool02_ptr_ring);
> +	if (enabled(bit_run_bench_no_softirq03))
> +		time_bench_loop(nr_loops, 0, "no-softirq-page_pool03", NULL,
> +				time_bench_page_pool03_slow);
> +
> +	return 0;
> +}
> +
> +static int __init bench_page_pool_simple_module_init(void)
> +{
> +	if (verbose)
> +		pr_info("Loaded\n");
> +
> +	if (loops > U32_MAX) {
> +		pr_err("Module param loops(%lu) exceeded U32_MAX(%u)\n", loops,
> +		       U32_MAX);
> +		return -ECHRNG;
> +	}
> +
> +	run_benchmark_tests();
> +
> +	return 0;
> +}
> +module_init(bench_page_pool_simple_module_init);
> +
> +static void __exit bench_page_pool_simple_module_exit(void)
> +{
> +	if (verbose)
> +		pr_info("Unloaded\n");
> +}
> +module_exit(bench_page_pool_simple_module_exit);
> +
> +MODULE_DESCRIPTION("Benchmark of page_pool simple cases");
> +MODULE_AUTHOR("Jesper Dangaard Brouer <netoptimizer@...uer.com>");
> +MODULE_LICENSE("GPL");
> diff --git a/tools/testing/selftests/net/bench/page_pool/time_bench.c b/tools/testing/selftests/net/bench/page_pool/time_bench.c
> new file mode 100644
> index 000000000000..073bb36ec5f2
> --- /dev/null
> +++ b/tools/testing/selftests/net/bench/page_pool/time_bench.c
> @@ -0,0 +1,394 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Benchmarking code execution time inside the kernel
> + *
> + * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer
> + */
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/module.h>
> +#include <linux/time.h>
> +
> +#include <linux/perf_event.h> /* perf_event_create_kernel_counter() */
> +
> +/* For concurrency testing */
> +#include <linux/completion.h>
> +#include <linux/sched.h>
> +#include <linux/workqueue.h>
> +#include <linux/kthread.h>
> +
> +#include "time_bench.h"
> +
> +static int verbose = 1;
> +
> +/** TSC (Time-Stamp Counter) based **
> + * See: linux/time_bench.h
> + *  tsc_start_clock() and tsc_stop_clock()
> + */
> +
> +/** Wall-clock based **
> + */
> +
> +/** PMU (Performance Monitor Unit) based **
> + */
> +#define PERF_FORMAT                                                            \
> +	(PERF_FORMAT_GROUP | PERF_FORMAT_ID | PERF_FORMAT_TOTAL_TIME_ENABLED | \
> +	 PERF_FORMAT_TOTAL_TIME_RUNNING)
> +
> +struct raw_perf_event {
> +	uint64_t config; /* event */
> +	uint64_t config1; /* umask */
> +	struct perf_event *save;
> +	char *desc;
> +};
> +
> +/* if HT is enable a maximum of 4 events (5 if one is instructions
> + * retired can be specified, if HT is disabled a maximum of 8 (9 if
> + * one is instructions retired) can be specified.
> + *
> + * From Table 19-1. Architectural Performance Events
> + * Architectures Software Developer’s Manual Volume 3: System Programming
> + * Guide
> + */
> +struct raw_perf_event perf_events[] = {
> +	{ 0x3c, 0x00, NULL, "Unhalted CPU Cycles" },
> +	{ 0xc0, 0x00, NULL, "Instruction Retired" }
> +};
> +
> +#define NUM_EVTS (ARRAY_SIZE(perf_events))
> +
> +/* WARNING: PMU config is currently broken!
> + */
> +bool time_bench_PMU_config(bool enable)
> +{
> +	int i;
> +	struct perf_event_attr perf_conf;
> +	struct perf_event *perf_event;
> +	int cpu;
> +
> +	preempt_disable();
> +	cpu = smp_processor_id();
> +	pr_info("DEBUG: cpu:%d\n", cpu);
> +	preempt_enable();
> +
> +	memset(&perf_conf, 0, sizeof(struct perf_event_attr));
> +	perf_conf.type           = PERF_TYPE_RAW;
> +	perf_conf.size           = sizeof(struct perf_event_attr);
> +	perf_conf.read_format    = PERF_FORMAT;
> +	perf_conf.pinned         = 1;
> +	perf_conf.exclude_user   = 1; /* No userspace events */
> +	perf_conf.exclude_kernel = 0; /* Only kernel events */
> +
> +	for (i = 0; i < NUM_EVTS; i++) {
> +		perf_conf.disabled = enable;
> +		//perf_conf.disabled = (i == 0) ? 1 : 0;
> +		perf_conf.config   = perf_events[i].config;
> +		perf_conf.config1  = perf_events[i].config1;
> +		if (verbose)
> +			pr_info("%s() enable PMU counter: %s\n",
> +				__func__, perf_events[i].desc);
> +		perf_event = perf_event_create_kernel_counter(&perf_conf, cpu,
> +							      NULL /* task */,
> +							      NULL /* overflow_handler*/,
> +							      NULL /* context */);
> +		if (perf_event) {
> +			perf_events[i].save = perf_event;
> +			pr_info("%s():DEBUG perf_event success\n", __func__);
> +
> +			perf_event_enable(perf_event);
> +		} else {
> +			pr_info("%s():DEBUG perf_event is NULL\n", __func__);
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +/** Generic functions **
> + */
> +
> +/* Calculate stats, store results in record */
> +bool time_bench_calc_stats(struct time_bench_record *rec)
> +{
> +#define NANOSEC_PER_SEC 1000000000 /* 10^9 */
> +	uint64_t ns_per_call_tmp_rem = 0;
> +	uint32_t ns_per_call_remainder = 0;
> +	uint64_t pmc_ipc_tmp_rem = 0;
> +	uint32_t pmc_ipc_remainder = 0;
> +	uint32_t pmc_ipc_div = 0;
> +	uint32_t invoked_cnt_precision = 0;
> +	uint32_t invoked_cnt = 0; /* 32-bit due to div_u64_rem() */
> +
> +	if (rec->flags & TIME_BENCH_LOOP) {
> +		if (rec->invoked_cnt < 1000) {
> +			pr_err("ERR: need more(>1000) loops(%llu) for timing\n",
> +			       rec->invoked_cnt);
> +			return false;
> +		}
> +		if (rec->invoked_cnt > ((1ULL << 32) - 1)) {
> +			/* div_u64_rem() can only support div with 32bit*/
> +			pr_err("ERR: Invoke cnt(%llu) too big overflow 32bit\n",
> +			       rec->invoked_cnt);
> +			return false;
> +		}
> +		invoked_cnt = (uint32_t)rec->invoked_cnt;
> +	}
> +
> +	/* TSC (Time-Stamp Counter) records */
> +	if (rec->flags & TIME_BENCH_TSC) {
> +		rec->tsc_interval = rec->tsc_stop - rec->tsc_start;
> +		if (rec->tsc_interval == 0) {
> +			pr_err("ABORT: timing took ZERO TSC time\n");
> +			return false;
> +		}
> +		/* Calculate stats */
> +		if (rec->flags & TIME_BENCH_LOOP)
> +			rec->tsc_cycles = rec->tsc_interval / invoked_cnt;
> +		else
> +			rec->tsc_cycles = rec->tsc_interval;
> +	}
> +
> +	/* Wall-clock time calc */
> +	if (rec->flags & TIME_BENCH_WALLCLOCK) {
> +		rec->time_start = rec->ts_start.tv_nsec +
> +				  (NANOSEC_PER_SEC * rec->ts_start.tv_sec);
> +		rec->time_stop = rec->ts_stop.tv_nsec +
> +				 (NANOSEC_PER_SEC * rec->ts_stop.tv_sec);
> +		rec->time_interval = rec->time_stop - rec->time_start;
> +		if (rec->time_interval == 0) {
> +			pr_err("ABORT: timing took ZERO wallclock time\n");
> +			return false;
> +		}
> +		/* Calculate stats */
> +		/*** Division in kernel it tricky ***/
> +		/* Orig: time_sec = (time_interval / NANOSEC_PER_SEC); */
> +		/* remainder only correct because NANOSEC_PER_SEC is 10^9 */
> +		rec->time_sec = div_u64_rem(rec->time_interval, NANOSEC_PER_SEC,
> +					    &rec->time_sec_remainder);
> +		//TODO: use existing struct timespec records instead of div?
> +
> +		if (rec->flags & TIME_BENCH_LOOP) {
> +			/*** Division in kernel it tricky ***/
> +			/* Orig: ns = ((double)time_interval / invoked_cnt); */
> +			/* First get quotient */
> +			rec->ns_per_call_quotient =
> +				div_u64_rem(rec->time_interval, invoked_cnt,
> +					    &ns_per_call_remainder);
> +			/* Now get decimals .xxx precision (incorrect roundup)*/
> +			ns_per_call_tmp_rem = ns_per_call_remainder;
> +			invoked_cnt_precision = invoked_cnt / 1000;
> +			if (invoked_cnt_precision > 0) {
> +				rec->ns_per_call_decimal =
> +					div_u64_rem(ns_per_call_tmp_rem,
> +						    invoked_cnt_precision,
> +						    &ns_per_call_remainder);
> +			}
> +		}
> +	}
> +
> +	/* Performance Monitor Unit (PMU) counters */
> +	if (rec->flags & TIME_BENCH_PMU) {
> +		//FIXME: Overflow handling???
> +		rec->pmc_inst = rec->pmc_inst_stop - rec->pmc_inst_start;
> +		rec->pmc_clk = rec->pmc_clk_stop - rec->pmc_clk_start;
> +
> +		/* Calc Instruction Per Cycle (IPC) */
> +		/* First get quotient */
> +		rec->pmc_ipc_quotient = div_u64_rem(rec->pmc_inst, rec->pmc_clk,
> +						    &pmc_ipc_remainder);
> +		/* Now get decimals .xxx precision (incorrect roundup)*/
> +		pmc_ipc_tmp_rem = pmc_ipc_remainder;
> +		pmc_ipc_div = rec->pmc_clk / 1000;
> +		if (pmc_ipc_div > 0) {
> +			rec->pmc_ipc_decimal = div_u64_rem(pmc_ipc_tmp_rem,
> +							   pmc_ipc_div,
> +							   &pmc_ipc_remainder);
> +		}
> +	}
> +
> +	return true;
> +}
> +
> +/* Generic function for invoking a loop function and calculating
> + * execution time stats.  The function being called/timed is assumed
> + * to perform a tight loop, and update the timing record struct.
> + */
> +bool time_bench_loop(uint32_t loops, int step, char *txt, void *data,
> +		     int (*func)(struct time_bench_record *record, void *data))
> +{
> +	struct time_bench_record rec;
> +
> +	/* Setup record */
> +	memset(&rec, 0, sizeof(rec)); /* zero func might not update all */
> +	rec.version_abi = 1;
> +	rec.loops       = loops;
> +	rec.step        = step;
> +	rec.flags       = (TIME_BENCH_LOOP | TIME_BENCH_TSC | TIME_BENCH_WALLCLOCK);
> +
> +	/*** Loop function being timed ***/
> +	if (!func(&rec, data)) {
> +		pr_err("ABORT: function being timed failed\n");
> +		return false;
> +	}
> +
> +	if (rec.invoked_cnt < loops)
> +		pr_warn("WARNING: Invoke count(%llu) smaller than loops(%d)\n",
> +			rec.invoked_cnt, loops);
> +
> +	/* Calculate stats */
> +	time_bench_calc_stats(&rec);
> +
> +	pr_info("Type:%s Per elem: %llu cycles(tsc) %llu.%03llu ns (step:%d) - (measurement period time:%llu.%09u sec time_interval:%llu) - (invoke count:%llu tsc_interval:%llu)\n",
> +		txt, rec.tsc_cycles, rec.ns_per_call_quotient,
> +		rec.ns_per_call_decimal, rec.step, rec.time_sec,
> +		rec.time_sec_remainder, rec.time_interval, rec.invoked_cnt,
> +		rec.tsc_interval);
> +	if (rec.flags & TIME_BENCH_PMU)
> +		pr_info("Type:%s PMU inst/clock%llu/%llu = %llu.%03llu IPC (inst per cycle)\n",
> +			txt, rec.pmc_inst, rec.pmc_clk, rec.pmc_ipc_quotient,
> +			rec.pmc_ipc_decimal);
> +	return true;
> +}
> +
> +/* Function getting invoked by kthread */
> +static int invoke_test_on_cpu_func(void *private)
> +{
> +	struct time_bench_cpu *cpu = private;
> +	struct time_bench_sync *sync = cpu->sync;
> +	cpumask_t newmask = CPU_MASK_NONE;
> +	void *data = cpu->data;
> +
> +	/* Restrict CPU */
> +	cpumask_set_cpu(cpu->rec.cpu, &newmask);
> +	set_cpus_allowed_ptr(current, &newmask);
> +
> +	/* Synchronize start of concurrency test */
> +	atomic_inc(&sync->nr_tests_running);
> +	wait_for_completion(&sync->start_event);
> +
> +	/* Start benchmark function */
> +	if (!cpu->bench_func(&cpu->rec, data)) {
> +		pr_err("ERROR: function being timed failed on CPU:%d(%d)\n",
> +		       cpu->rec.cpu, smp_processor_id());
> +	} else {
> +		if (verbose)
> +			pr_info("SUCCESS: ran on CPU:%d(%d)\n", cpu->rec.cpu,
> +				smp_processor_id());
> +	}
> +	cpu->did_bench_run = true;
> +
> +	/* End test */
> +	atomic_dec(&sync->nr_tests_running);
> +	/*  Wait for kthread_stop() telling us to stop */
> +	while (!kthread_should_stop()) {
> +		set_current_state(TASK_INTERRUPTIBLE);
> +		schedule();
> +	}
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +}
> +
> +void time_bench_print_stats_cpumask(const char *desc,
> +				    struct time_bench_cpu *cpu_tasks,
> +				    const struct cpumask *mask)
> +{
> +	uint64_t average = 0;
> +	int cpu;
> +	int step = 0;
> +	struct sum {
> +		uint64_t tsc_cycles;
> +		int records;
> +	} sum = { 0 };
> +
> +	/* Get stats */
> +	for_each_cpu(cpu, mask) {
> +		struct time_bench_cpu *c = &cpu_tasks[cpu];
> +		struct time_bench_record *rec = &c->rec;
> +
> +		/* Calculate stats */
> +		time_bench_calc_stats(rec);
> +
> +		pr_info("Type:%s CPU(%d) %llu cycles(tsc) %llu.%03llu ns (step:%d) - (measurement period time:%llu.%09u sec time_interval:%llu) - (invoke count:%llu tsc_interval:%llu)\n",
> +			desc, cpu, rec->tsc_cycles, rec->ns_per_call_quotient,
> +			rec->ns_per_call_decimal, rec->step, rec->time_sec,
> +			rec->time_sec_remainder, rec->time_interval,
> +			rec->invoked_cnt, rec->tsc_interval);
> +
> +		/* Collect average */
> +		sum.records++;
> +		sum.tsc_cycles += rec->tsc_cycles;
> +		step = rec->step;
> +	}
> +
> +	if (sum.records) /* avoid div-by-zero */
> +		average = sum.tsc_cycles / sum.records;
> +	pr_info("Sum Type:%s Average: %llu cycles(tsc) CPUs:%d step:%d\n", desc,
> +		average, sum.records, step);
> +}
> +
> +void time_bench_run_concurrent(uint32_t loops, int step, void *data,
> +			       const struct cpumask *mask, /* Support masking outsome CPUs*/
> +			       struct time_bench_sync *sync,
> +			       struct time_bench_cpu *cpu_tasks,
> +			       int (*func)(struct time_bench_record *record, void *data))
> +{
> +	int cpu, running = 0;
> +
> +	if (verbose) // DEBUG
> +		pr_warn("%s() Started on CPU:%d\n", __func__,
> +			smp_processor_id());
> +
> +	/* Reset sync conditions */
> +	atomic_set(&sync->nr_tests_running, 0);
> +	init_completion(&sync->start_event);
> +
> +	/* Spawn off jobs on all CPUs */
> +	for_each_cpu(cpu, mask) {
> +		struct time_bench_cpu *c = &cpu_tasks[cpu];
> +
> +		running++;
> +		c->sync = sync; /* Send sync variable along */
> +		c->data = data; /* Send opaque along */
> +
> +		/* Init benchmark record */
> +		memset(&c->rec, 0, sizeof(struct time_bench_record));
> +		c->rec.version_abi = 1;
> +		c->rec.loops       = loops;
> +		c->rec.step        = step;
> +		c->rec.flags       = (TIME_BENCH_LOOP | TIME_BENCH_TSC |
> +				      TIME_BENCH_WALLCLOCK);
> +		c->rec.cpu = cpu;
> +		c->bench_func = func;
> +		c->task = kthread_run(invoke_test_on_cpu_func, c,
> +				      "time_bench%d", cpu);
> +		if (IS_ERR(c->task)) {
> +			pr_err("%s(): Failed to start test func\n", __func__);
> +			return; /* Argh, what about cleanup?! */
> +		}
> +	}
> +
> +	/* Wait until all processes are running */
> +	while (atomic_read(&sync->nr_tests_running) < running) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		schedule_timeout(10);
> +	}
> +	/* Kick off all CPU concurrently on completion event */
> +	complete_all(&sync->start_event);
> +
> +	/* Wait for CPUs to finish */
> +	while (atomic_read(&sync->nr_tests_running)) {
> +		set_current_state(TASK_UNINTERRUPTIBLE);
> +		schedule_timeout(10);
> +	}
> +
> +	/* Stop the kthreads */
> +	for_each_cpu(cpu, mask) {
> +		struct time_bench_cpu *c = &cpu_tasks[cpu];
> +
> +		kthread_stop(c->task);
> +	}
> +
> +	if (verbose) // DEBUG - happens often, finish on another CPU
> +		pr_warn("%s() Finished on CPU:%d\n", __func__,
> +			smp_processor_id());
> +}
> diff --git a/tools/testing/selftests/net/bench/page_pool/time_bench.h b/tools/testing/selftests/net/bench/page_pool/time_bench.h
> new file mode 100644
> index 000000000000..e113fcf341dc
> --- /dev/null
> +++ b/tools/testing/selftests/net/bench/page_pool/time_bench.h
> @@ -0,0 +1,238 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Benchmarking code execution time inside the kernel
> + *
> + * Copyright (C) 2014, Red Hat, Inc., Jesper Dangaard Brouer
> + *  for licensing details see kernel-base/COPYING
> + */
> +#ifndef _LINUX_TIME_BENCH_H
> +#define _LINUX_TIME_BENCH_H
> +
> +/* Main structure used for recording a benchmark run */
> +struct time_bench_record {
> +	uint32_t version_abi;
> +	uint32_t loops;		/* Requested loop invocations */
> +	uint32_t step;		/* option for e.g. bulk invocations */
> +
> +	uint32_t flags;		/* Measurements types enabled */
> +#define TIME_BENCH_LOOP		BIT(0)
> +#define TIME_BENCH_TSC		BIT(1)
> +#define TIME_BENCH_WALLCLOCK	BIT(2)
> +#define TIME_BENCH_PMU		BIT(3)
> +
> +	uint32_t cpu; /* Used when embedded in time_bench_cpu */
> +
> +	/* Records */
> +	uint64_t invoked_cnt;	/* Returned actual invocations */
> +	uint64_t tsc_start;
> +	uint64_t tsc_stop;
> +	struct timespec64 ts_start;
> +	struct timespec64 ts_stop;
> +	/* PMU counters for instruction and cycles
> +	 * instructions counter including pipelined instructions
> +	 */
> +	uint64_t pmc_inst_start;
> +	uint64_t pmc_inst_stop;
> +	/* CPU unhalted clock counter */
> +	uint64_t pmc_clk_start;
> +	uint64_t pmc_clk_stop;
> +
> +	/* Result records */
> +	uint64_t tsc_interval;
> +	uint64_t time_start, time_stop, time_interval; /* in nanosec */
> +	uint64_t pmc_inst, pmc_clk;
> +
> +	/* Derived result records */
> +	uint64_t tsc_cycles; // +decimal?
> +	uint64_t ns_per_call_quotient, ns_per_call_decimal;
> +	uint64_t time_sec;
> +	uint32_t time_sec_remainder;
> +	uint64_t pmc_ipc_quotient, pmc_ipc_decimal; /* inst per cycle */
> +};
> +
> +/* For synchronizing parallel CPUs to run concurrently */
> +struct time_bench_sync {
> +	atomic_t nr_tests_running;
> +	struct completion start_event;
> +};
> +
> +/* Keep track of CPUs executing our bench function.
> + *
> + * Embed a time_bench_record for storing info per cpu
> + */
> +struct time_bench_cpu {
> +	struct time_bench_record rec;
> +	struct time_bench_sync *sync; /* back ptr */
> +	struct task_struct *task;
> +	/* "data" opaque could have been placed in time_bench_sync,
> +	 * but to avoid any false sharing, place it per CPU
> +	 */
> +	void *data;
> +	/* Support masking outsome CPUs, mark if it ran */
> +	bool did_bench_run;
> +	/* int cpu; // note CPU stored in time_bench_record */
> +	int (*bench_func)(struct time_bench_record *record, void *data);
> +};
> +
> +/*
> + * Below TSC assembler code is not compatible with other archs, and
> + * can also fail on guests if cpu-flags are not correct.
> + *
> + * The way TSC reading is used, many iterations, does not require as
> + * high accuracy as described below (in Intel Doc #324264).
> + *
> + * Considering changing to use get_cycles() (#include <asm/timex.h>).
> + */
> +
> +/** TSC (Time-Stamp Counter) based **
> + * Recommend reading, to understand details of reading TSC accurately:
> + *  Intel Doc #324264, "How to Benchmark Code Execution Times on Intel"
> + *
> + * Consider getting exclusive ownership of CPU by using:
> + *   unsigned long flags;
> + *   preempt_disable();
> + *   raw_local_irq_save(flags);
> + *   _your_code_
> + *   raw_local_irq_restore(flags);
> + *   preempt_enable();
> + *
> + * Clobbered registers: "%rax", "%rbx", "%rcx", "%rdx"
> + *  RDTSC only change "%rax" and "%rdx" but
> + *  CPUID clears the high 32-bits of all (rax/rbx/rcx/rdx)
> + */
> +static __always_inline uint64_t tsc_start_clock(void)
> +{
> +	/* See: Intel Doc #324264 */
> +	unsigned int hi, lo;
> +
> +	asm volatile("CPUID\n\t"
> +		     "RDTSC\n\t"
> +		     "mov %%edx, %0\n\t"
> +		     "mov %%eax, %1\n\t"
> +		     : "=r"(hi), "=r"(lo)::"%rax", "%rbx", "%rcx", "%rdx");
> +	//FIXME: on 32bit use clobbered %eax + %edx
> +	return ((uint64_t)lo) | (((uint64_t)hi) << 32);
> +}
> +
> +static __always_inline uint64_t tsc_stop_clock(void)
> +{
> +	/* See: Intel Doc #324264 */
> +	unsigned int hi, lo;
> +
> +	asm volatile("RDTSCP\n\t"
> +		     "mov %%edx, %0\n\t"
> +		     "mov %%eax, %1\n\t"
> +		     "CPUID\n\t"
> +		     : "=r"(hi), "=r"(lo)::"%rax", "%rbx", "%rcx", "%rdx");
> +	return ((uint64_t)lo) | (((uint64_t)hi) << 32);
> +}
> +
> +/** Wall-clock based **
> + *
> + * use: getnstimeofday()
> + *  getnstimeofday(&rec->ts_start);
> + *  getnstimeofday(&rec->ts_stop);
> + *
> + * API changed see: Documentation/core-api/timekeeping.rst
> + *  https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#c.getnstimeofday
> + *
> + * We should instead use: ktime_get_real_ts64() is a direct
> + *  replacement, but consider using monotonic time (ktime_get_ts64())
> + *  and/or a ktime_t based interface (ktime_get()/ktime_get_real()).
> + */
> +
> +/** PMU (Performance Monitor Unit) based **
> + *
> + * Needed for calculating: Instructions Per Cycle (IPC)
> + * - The IPC number tell how efficient the CPU pipelining were
> + */
> +//lookup: perf_event_create_kernel_counter()
> +
> +bool time_bench_PMU_config(bool enable);
> +
> +/* Raw reading via rdpmc() using fixed counters
> + *
> + * From: https://github.com/andikleen/simple-pmu
> + */
> +enum {
> +	FIXED_SELECT = (1U << 30), /* == 0x40000000 */
> +	FIXED_INST_RETIRED_ANY = 0,
> +	FIXED_CPU_CLK_UNHALTED_CORE = 1,
> +	FIXED_CPU_CLK_UNHALTED_REF = 2,
> +};
> +
> +static __always_inline unsigned int long long p_rdpmc(unsigned int in)
> +{
> +	unsigned int d, a;
> +
> +	asm volatile("rdpmc" : "=d"(d), "=a"(a) : "c"(in) : "memory");
> +	return ((unsigned long long)d << 32) | a;
> +}
> +
> +/* These PMU counter needs to be enabled, but I don't have the
> + * configure code implemented.  My current hack is running:
> + *  sudo perf stat -e cycles:k -e instructions:k insmod lib/ring_queue_test.ko
> + */
> +/* Reading all pipelined instruction */
> +static __always_inline unsigned long long pmc_inst(void)
> +{
> +	return p_rdpmc(FIXED_SELECT | FIXED_INST_RETIRED_ANY);
> +}
> +
> +/* Reading CPU clock cycles */
> +static __always_inline unsigned long long pmc_clk(void)
> +{
> +	return p_rdpmc(FIXED_SELECT | FIXED_CPU_CLK_UNHALTED_CORE);
> +}
> +
> +/* Raw reading via MSR rdmsr() is likely wrong
> + * FIXME: How can I know which raw MSR registers are conf for what?
> + */
> +#define MSR_IA32_PCM0 0x400000C1 /* PERFCTR0 */
> +#define MSR_IA32_PCM1 0x400000C2 /* PERFCTR1 */
> +#define MSR_IA32_PCM2 0x400000C3
> +static inline uint64_t msr_inst(unsigned long long *msr_result)
> +{
> +	return rdmsrq_safe(MSR_IA32_PCM0, msr_result);
> +}
> +
> +/** Generic functions **
> + */
> +bool time_bench_loop(uint32_t loops, int step, char *txt, void *data,
> +		     int (*func)(struct time_bench_record *rec, void *data));
> +bool time_bench_calc_stats(struct time_bench_record *rec);
> +
> +void time_bench_run_concurrent(uint32_t loops, int step, void *data,
> +			       const struct cpumask *mask, /* Support masking outsome CPUs*/
> +			       struct time_bench_sync *sync, struct time_bench_cpu *cpu_tasks,
> +			       int (*func)(struct time_bench_record *record, void *data));
> +void time_bench_print_stats_cpumask(const char *desc,
> +				    struct time_bench_cpu *cpu_tasks,
> +				    const struct cpumask *mask);
> +
> +//FIXME: use rec->flags to select measurement, should be MACRO
> +static __always_inline void time_bench_start(struct time_bench_record *rec)
> +{
> +	//getnstimeofday(&rec->ts_start);
> +	ktime_get_real_ts64(&rec->ts_start);
> +	if (rec->flags & TIME_BENCH_PMU) {
> +		rec->pmc_inst_start = pmc_inst();
> +		rec->pmc_clk_start = pmc_clk();
> +	}
> +	rec->tsc_start = tsc_start_clock();
> +}
> +
> +static __always_inline void time_bench_stop(struct time_bench_record *rec,
> +					    uint64_t invoked_cnt)
> +{
> +	rec->tsc_stop = tsc_stop_clock();
> +	if (rec->flags & TIME_BENCH_PMU) {
> +		rec->pmc_inst_stop = pmc_inst();
> +		rec->pmc_clk_stop = pmc_clk();
> +	}
> +	//getnstimeofday(&rec->ts_stop);
> +	ktime_get_real_ts64(&rec->ts_stop);
> +	rec->invoked_cnt = invoked_cnt;
> +}
> +
> +#endif /* _LINUX_TIME_BENCH_H */
> diff --git a/tools/testing/selftests/net/bench/test_bench_page_pool.sh b/tools/testing/selftests/net/bench/test_bench_page_pool.sh
> new file mode 100755
> index 000000000000..7b8b18cfedce
> --- /dev/null
> +++ b/tools/testing/selftests/net/bench/test_bench_page_pool.sh
> @@ -0,0 +1,32 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +
> +set -e
> +
> +DRIVER="./page_pool/bench_page_pool.ko"
> +result=""
> +
> +function run_test()
> +{
> +	rmmod "bench_page_pool.ko" || true
> +	insmod $DRIVER > /dev/null 2>&1
> +	result=$(dmesg | tail -10)
> +	echo "$result"
> +
> +	echo
> +	echo "Fast path results:"
> +	echo "${result}" | grep -o -E "no-softirq-page_pool01 Per elem: ([0-9]+) cycles\(tsc\) ([0-9]+\.[0-9]+) ns"
> +
> +	echo
> +	echo "ptr_ring results:"
> +	echo "${result}" | grep -o -E "no-softirq-page_pool02 Per elem: ([0-9]+) cycles\(tsc\) ([0-9]+\.[0-9]+) ns"
> +
> +	echo
> +	echo "slow path results:"
> +	echo "${result}" | grep -o -E "no-softirq-page_pool03 Per elem: ([0-9]+) cycles\(tsc\) ([0-9]+\.[0-9]+) ns"
> +}
> +
> +run_test
> +
> +exit 0
> 
> base-commit: 8909f5f4ecd551c2299b28e05254b77424c8c7dc