lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 2 Feb 2022 15:04:33 +0100
From:   Jesper Dangaard Brouer <jbrouer@...hat.com>
To:     Joe Damato <jdamato@...tly.com>,
        Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc:     brouer@...hat.com, netdev@...r.kernel.org, kuba@...nel.org,
        davem@...emloft.net, ilias.apalodimas@...aro.org, hawk@...nel.org
Subject: Re: [PATCH net-next 0/6] net: page_pool: Add page_pool stat counters

On 27/01/2022 22.08, Joe Damato wrote:
> On Thu, Jan 27, 2022 at 12:51 AM Jesper Dangaard Brouer
> <jbrouer@...hat.com> wrote:
>>
>> On 26/01/2022 23.48, Joe Damato wrote:
>>> Greetings:
>>>
>>> This series adds some stat counters for the page_pool allocation path which
>>> help to track:
>>>
>>>        - fast path allocations
>>>        - slow path order-0 allocations
>>>        - slow path high order allocations
>>>        - refills which failed due to an empty ptr ring, forcing a slow
>>>          path allocation
>>>        - allocations fulfilled via successful refill
>>>        - pages which cannot be added to the cache because of numa mismatch
>>>          (i.e. waived)
>>>
>>> Some static inline wrappers are provided for accessing these stats. The
>>> intention is that drivers which use the page_pool API can, if they choose,
>>> use this stats API.
>>
>> You are adding (always on) counters to a critical fast-path, that
>> drivers uses for the XDP_DROP use-case.
> 
> If you prefer requiring users explicitly enable these stats, I am
> happy to add a kernel config option (e.g. CONFIG_PAGE_POOL_DEBUG or
> similar) in a v2.
> 
>> I want to see performance measurements as documentation, showing this is
>> not causing a slow-down.
>>
>> I have some performance tests here[1]:
>>    [1]
>> https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/lib
>>
>> Look at:
>>    - bench_page_pool_simple.c and
>>    - bench_page_pool_cross_cpu.c
>>
>> How to use + build this[2]:
>>    [2]
>> https://prototype-kernel.readthedocs.io/en/latest/prototype-kernel/build-process.html
> 
> Thanks for the pointers to the benchmarks.
> 
> In general, I noted that the benchmark results varied fairly
> substantially between repeated runs on the same system.
> 
> Results below suggest that:
>     -  bench_page_pool_simple is faster on the test kernel, and
>     - bench_page_pool_cross_cpu faster on the control
> 
> Subsequent runs of bench_page_pool_cross_cpu on the control, however,
> reveal *much* slower results than shown below.
> 
> Test system:
>    - 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
>    - 2 numa zones, with 18 cores per zone and 2 threads per core
> 
> Control kernel: built from net-next at commit e2cf07654efb ("ptp:
> replace snprintf with sysfs_emit").
> Test kernel: This series applied on top of control kernel mentioned above.
> 
> Raw output from dmesg for control [1] and test [2] kernel summarized below:
> 
> bench_page_pool_simple
>    - run with default options (i.e. "sudo mod probe bench_page_pool_simple").
> 
> Control:
> 
> Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0)
> Type:atomic_inc Per elem: 13 cycles(tsc) 6.021 ns (step:0)
> Type:lock Per elem: 31 cycles(tsc) 13.514 ns (step:0)
> 
> Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.549 ns (step:0)
> Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.658 ns (step:0)
> Type:no-softirq-page_pool03 Per elem: 118 cycles(tsc) 51.638 ns (step:0)
> 
> Type:tasklet_page_pool01_fast_path Per elem: 17 cycles(tsc) 7.472 ns (step:0)
> Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.585 ns (step:0)
> Type:tasklet_page_pool03_slow Per elem: 109 cycles(tsc) 47.807 ns (step:0)
> 
> Test:
> 
> Type:for_loop Per elem: 0 cycles(tsc) 0.334 ns (step:0)
> Type:atomic_inc Per elem: 14 cycles(tsc) 6.195 ns (step:0)
> Type:lock Per elem: 31 cycles(tsc) 13.827 ns (step:0)
> 
> Type:no-softirq-page_pool01 Per elem: 44 cycles(tsc) 19.561 ns (step:0)
> Type:no-softirq-page_pool02 Per elem: 45 cycles(tsc) 19.700 ns (step:0)
> Type:no-softirq-page_pool03 Per elem: 108 cycles(tsc) 47.186 ns (step:0)
> 
> Type:tasklet_page_pool01_fast_path Per elem: 12 cycles(tsc) 5.447 ns (step:0)

Watch out for the: measurement period time:0.054478253 (taken from [2])

If the measurement period becomes too small, you/we cannot use the 
results.  Perhaps I've set the default 'loops' variable too low, for 
these modern systems.  Hint it is adjust as module parameter 'loops'.



> Type:tasklet_page_pool02_ptr_ring Per elem: 42 cycles(tsc) 18.501 ns (step:0)
> Type:tasklet_page_pool03_slow Per elem: 106 cycles(tsc) 46.313 ns (step:0)
> 
> bench_page_pool_cross_cpu
>    - run with default options (i.e. "sudo mod probe bench_page_pool_cross_cpu").
> 
> Control:
> Type:page_pool_cross_cpu CPU(0) 1795 cycles(tsc) 782.567 ns (step:2)
> Type:page_pool_cross_cpu CPU(1) 1921 cycles(tsc) 837.435 ns (step:2)
> Type:page_pool_cross_cpu CPU(2) 960 cycles(tsc) 418.758 ns (step:2)
> Sum Type:page_pool_cross_cpu Average: 1558 cycles(tsc) CPUs:3 step:2
> 
> Test:
> Type:page_pool_cross_cpu CPU(0) 2411 cycles(tsc) 1051.037 ns (step:2)
> Type:page_pool_cross_cpu CPU(1) 2467 cycles(tsc) 1075.204 ns (step:2)
> Type:page_pool_cross_cpu CPU(2) 1233 cycles(tsc) 537.629 ns (step:2)
> Type:page_pool_cross_cpu Average: 2037 cycles(tsc) CPUs:3 step:2

I think the effect you are seeing here is because you placed your stats 
struct on the a cache-line that is also used by remote CPUs 
freeing/recycling page'es back to the page_pool.


> 
> [1]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_control
> [2]: https://gist.githubusercontent.com/jdamato-fsly/385806f06cb95c61ff8cecf7a3645e75/raw/886e3208f5b9c47abdd59bdaa7ecf27994f477b1/page_pool_bench_TESTKERNEL
> 
> 
>>> It assumed that the API consumer will ensure the page_pool is not destroyed
>>> during calls to the stats API.
>>>
>>> If this series is accepted, I'll submit a follow up patch which will export
>>> these stats per RX-ring via ethtool in a driver which uses the page_pool
>>> API.
>>>
>>> Joe Damato (6):
>>>     net: page_pool: Add alloc stats and fast path stat
>>>     net: page_pool: Add a stat for the slow alloc path
>>>     net: page_pool: Add a high order alloc stat
>>>     net: page_pool: Add stat tracking empty ring
>>>     net: page_pool: Add stat tracking cache refills.
>>>     net: page_pool: Add a stat tracking waived pages.
>>>
>>>    include/net/page_pool.h | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
>>>    net/core/page_pool.c    | 15 +++++++--
>>>    2 files changed, 94 insertions(+), 3 deletions(-)
>>>
>>
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ