netdev - Re: [net-next v5 1/2] page_pool: Add page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CALALjgzUQUuEVkNXous0kOcHHqiSrTem+n9MjQh6q-8+Azi-sg@mail.gmail.com>
Date:   Wed, 16 Feb 2022 09:04:56 -0800
From:   Joe Damato <jdamato@...tly.com>
To:     Jesper Dangaard Brouer <jbrouer@...hat.com>
Cc:     netdev@...r.kernel.org, kuba@...nel.org,
        ilias.apalodimas@...aro.org, davem@...emloft.net, hawk@...nel.org,
        saeed@...nel.org, ttoukan.linux@...il.com, brouer@...hat.com
Subject: Re: [net-next v5 1/2] page_pool: Add page_pool stat counters

On Tue, Feb 15, 2022 at 7:41 AM Jesper Dangaard Brouer
<jbrouer@...hat.com> wrote:
>
>
> On 14/02/2022 21.02, Joe Damato wrote:
> > Add per-cpu per-pool statistics counters for the allocation path of a page
> > pool.
> >
> > This code is disabled by default and a kernel config option is provided for
> > users who wish to enable them.
> >
> > The statistics added are:
> >       - fast: successful fast path allocations
> >       - slow: slow path order-0 allocations
> >       - slow_high_order: slow path high order allocations
> >       - empty: ptr ring is empty, so a slow path allocation was forced.
> >       - refill: an allocation which triggered a refill of the cache
> >       - waive: pages obtained from the ptr ring that cannot be added to
> >         the cache due to a NUMA mismatch.
> >
> > Signed-off-by: Joe Damato <jdamato@...tly.com>
> > ---
> >   include/net/page_pool.h | 18 ++++++++++++++++++
> >   net/Kconfig             | 13 +++++++++++++
> >   net/core/page_pool.c    | 37 +++++++++++++++++++++++++++++++++----
> >   3 files changed, 64 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> > index 97c3c19..d827ab1 100644
> > --- a/include/net/page_pool.h
> > +++ b/include/net/page_pool.h
> > @@ -135,7 +135,25 @@ struct page_pool {
> >       refcount_t user_cnt;
> >
> >       u64 destroy_cnt;
> > +#ifdef CONFIG_PAGE_POOL_STATS
> > +     struct page_pool_stats __percpu *stats;
> > +#endif
> > +};
>
> You still have to consider cache-line locality, as I have pointed out
> before.

You are right, I forgot to include that in this revision. Sorry about
that and thanks for the review and response.

> This placement is wrong!
>
> Output from pahole:
>
>   /* --- cacheline 23 boundary (1472 bytes) --- */
>   atomic_t                   pages_state_release_cnt; /*  1472     4 */
>   refcount_t                 user_cnt;             /*  1476     4 */
>   u64                        destroy_cnt;          /*  1480     8 */
>
> Your *stats pointer end-up on a cache-line that "remote" CPUs will write
> into (pages_state_release_cnt).
> This is why we see a slowdown to the 'bench_page_pool_cross_cpu' test.

If I give *stats its own cache-line by adding
____cacheline_aligned_in_smp (but leaving the placement at the end of
the page_pool struct), pahole reports:

/* --- cacheline 24 boundary (1536 bytes) --- */
atomic_t                   pages_state_release_cnt; /*  1536     4 */
refcount_t                 user_cnt;             /*  1540     4 */
u64                        destroy_cnt;          /*  1544     8 */

/* XXX 48 bytes hole, try to pack */

/* --- cacheline 25 boundary (1600 bytes) --- */
struct page_pool_stats *   stats;                /*  1600     8 */

Re-running bench_page_pool_cross_cpu loops=20000000 returning_cpus=4
still shows a fairly large variation in cycles (measurement period
time of 34.128419339 sec) from run to run; roughly a delta of 287
cycles in the runs I just performed back to back.

The best measurements after making the cache-line change described
above are faster than compiling the kernel with stats disabled. The
worst measurements, however, are very close to the data I submit in
the cover letter for this revision.

As far as I can tell xdp_mem_id is not written to particularly often -
only when RX rings are configured in the driver - so I also tried
moving *stats above xdp_mem_id so that they share a cache-line and
reduce the size of the hole between xdp_mem_id and pp_alloc_cache.

pahole reports:

/* --- cacheline 3 boundary (192 bytes) --- */
struct page_pool_stats *   stats;                /*   192     8 */
u32                        xdp_mem_id;           /*   200     4 */

Results of bench_page_pool_cross_cpu loops=20000000 returning_cpus=4
are the same as above -- the best measurements are faster than stats
disabled, the worst are very close to what I mentioned in the cover
letter for this v5.

I am happy to submit a v6 with *stats placed in either location; on
its own cache-line at the expense of a larger page_pool struct, or
placed near xdp_mem_id to consume some of the hole between xdp_mem_id
and pp_alloc_cache.

Do you have a preference on giving the stats pointer its own
cache-line vs sharing a line with xdp_mem_id?

In either case: the benchmarks don't seem to show a consistent
significant improvement on my test hardware. Is there another
benchmark or a different set of arguments you think I should use when
running this test?

My apologies if I am missing something obvious here.

Thanks,
Joe