[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALALjgxwkRBFd86=kcOjeF4uvhCgB52gG7ka_e16=Qf-3LQiFQ@mail.gmail.com>
Date: Thu, 27 Jan 2022 13:08:54 -0800
From: Joe Damato <jdamato@...tly.com>
To: Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc: netdev@...r.kernel.org, brouer@...hat.com, kuba@...nel.org,
davem@...emloft.net, ilias.apalodimas@...aro.org, hawk@...nel.org
Subject: Re: [PATCH net-next 0/6] net: page_pool: Add page_pool stat counters
On Thu, Jan 27, 2022 at 12:51 AM Jesper Dangaard Brouer
<jbrouer@...hat.com> wrote:
>
>
>
> On 26/01/2022 23.48, Joe Damato wrote:
> > Greetings:
> >
> > This series adds some stat counters for the page_pool allocation path which
> > help to track:
> >
> > - fast path allocations
> > - slow path order-0 allocations
> > - slow path high order allocations
> > - refills which failed due to an empty ptr ring, forcing a slow
> > path allocation
> > - allocations fulfilled via successful refill
> > - pages which cannot be added to the cache because of numa mismatch
> > (i.e. waived)
> >
> > Some static inline wrappers are provided for accessing these stats. The
> > intention is that drivers which use the page_pool API can, if they choose,
> > use this stats API.
>
> You are adding (always on) counters to a critical fast-path, that
> drivers uses for the XDP_DROP use-case.
If you prefer requiring users explicitly enable these stats, I am
happy to add a kernel config option (e.g. CONFIG_PAGE_POOL_DEBUG or
similar) in a v2.
> I want to see performance measurements as documentation, showing this is
> not causing a slow-down.
>
> I have some performance tests here[1]:
> [1]
> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/lib
>
> Look at:
> - bench_page_pool_simple.c and
> - bench_page_pool_cross_cpu.c
>
> How to use + build this[2]:
> [2]
> https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html
Thanks for the pointers to the benchmarks.
In general, I noted that the benchmark results varied fairly
substantially between repeated runs on the same system.
Results below suggest that:
- bench_page_pool_simple is faster on the test kernel, and
- bench_page_pool_cross_cpu faster on the control
Subsequent runs of bench_page_pool_cross_cpu on the control, however,
reveal *much* slower results than shown below.
Test system:
- 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
- 2 numa zones, with 18 cores per zone and 2 threads per core
Control kernel: built from net-next at commit e2cf07654efb ("ptp:
replace snprintf with sysfs_emit").
Test kernel: This series applied on top of control kernel mentioned above.
Raw output from dmesg for control [1] and test [2] kernel summarized below:
bench_page_pool_simple
- run with default options (i.e. "sudo mod probe bench_page_pool_simple").
Control:
Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0)
Type:atomic_inc Per elem: 13 cycles(tsc) 6.021 ns (step:0)
Type:lock Per elem: 31 cycles(tsc) 13.514 ns (step:0)
Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.549 ns (step:0)
Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.658 ns (step:0)
Type:no-softirq-page_pool03 Per elem: 118 cycles(tsc) 51.638 ns (step:0)
Type:tasklet_page_pool01_fast_path Per elem: 17 cycles(tsc) 7.472 ns (step:0)
Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.585 ns (step:0)
Type:tasklet_page_pool03_slow Per elem: 109 cycles(tsc) 47.807 ns (step:0)
Test:
Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0)
Type:atomic_inc Per elem: 14 cycles(tsc) 6.195 ns (step:0)
Type:lock Per elem: 31 cycles(tsc) 13.827 ns (step:0)
Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.561 ns (step:0)
Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.700 ns (step:0)
Type:no-softirq-page_pool03 Per elem: 108 cycles(tsc) 47.186 ns (step:0)
Type:tasklet_page_pool01_fast_path Per elem: 12 cycles(tsc) 5.447 ns (step:0)
Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.501 ns (step:0)
Type:tasklet_page_pool03_slow Per elem: 106 cycles(tsc) 46.313 ns (step:0)
bench_page_pool_cross_cpu
- run with default options (i.e. "sudo mod probe bench_page_pool_cross_cpu").
Control:
Type:page_pool_cross_cpu CPU(0) 1795 cycles(tsc) 782.567 ns (step:2)
Type:page_pool_cross_cpu CPU(1) 1921 cycles(tsc) 837.435 ns (step:2)
Type:page_pool_cross_cpu CPU(2) 960 cycles(tsc) 418.758 ns (step:2)
Sum Type:page_pool_cross_cpu Average: 1558 cycles(tsc) CPUs:3 step:2
Test:
Type:page_pool_cross_cpu CPU(0) 2411 cycles(tsc) 1051.037 ns (step:2)
Type:page_pool_cross_cpu CPU(1) 2467 cycles(tsc) 1075.204 ns (step:2)
Type:page_pool_cross_cpu CPU(2) 1233 cycles(tsc) 537.629 ns (step:2)
Type:page_pool_cross_cpu Average: 2037 cycles(tsc) CPUs:3 step:2
[1]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_control
[2]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_TESTKERNEL
> > It assumed that the API consumer will ensure the page_pool is not destroyed
> > during calls to the stats API.
> >
> > If this series is accepted, I'll submit a follow up patch which will export
> > these stats per RX-ring via ethtool in a driver which uses the page_pool
> > API.
> >
> > Joe Damato (6):
> > net: page_pool: Add alloc stats and fast path stat
> > net: page_pool: Add a stat for the slow alloc path
> > net: page_pool: Add a high order alloc stat
> > net: page_pool: Add stat tracking empty ring
> > net: page_pool: Add stat tracking cache refills.
> > net: page_pool: Add a stat tracking waived pages.
> >
> > include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
> > net/core/page_pool.c | 15 +++++++--
> > 2 files changed, 94 insertions(+), 3 deletions(-)
> >
>
Powered by blists - more mailing lists