linux-kernel - Re: [PATCH net-next 0/6] page

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YFrvTtS8E/C5vYgo@enceladus>
Date:   Wed, 24 Mar 2021 09:50:38 +0200
From:   Ilias Apalodimas <ilias.apalodimas@...aro.org>
To:     Alexander Lobakin <alobakin@...me>
Cc:     Matteo Croce <mcroce@...ux.microsoft.com>,
        Jesper Dangaard Brouer <jbrouer@...hat.com>,
        netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
        Jonathan Lemon <jonathan.lemon@...il.com>,
        "David S. Miller" <davem@...emloft.net>,
        Jesper Dangaard Brouer <hawk@...nel.org>,
        Lorenzo Bianconi <lorenzo@...nel.org>,
        Saeed Mahameed <saeedm@...dia.com>,
        David Ahern <dsahern@...il.com>,
        Saeed Mahameed <saeed@...nel.org>, Andrew Lunn <andrew@...n.ch>
Subject: Re: [PATCH net-next 0/6] page_pool: recycle buffers

Hi Alexander,

On Tue, Mar 23, 2021 at 08:03:46PM +0000, Alexander Lobakin wrote:
> From: Ilias Apalodimas <ilias.apalodimas@...aro.org>
> Date: Tue, 23 Mar 2021 19:01:52 +0200
> 
> > On Tue, Mar 23, 2021 at 04:55:31PM +0000, Alexander Lobakin wrote:
> > > > > > > >
> >
> > [...]
> >
> > > > > > >
> > > > > > > Thanks for the testing!
> > > > > > > Any chance you can get a perf measurement on this?
> > > > > >
> > > > > > I guess you mean perf-report (--stdio) output, right?
> > > > > >
> > > > >
> > > > > Yea,
> > > > > As hinted below, I am just trying to figure out if on Alexander's platform the
> > > > > cost of syncing, is bigger that free-allocate. I remember one armv7 were that
> > > > > was the case.
> > > > >
> > > > > > > Is DMA syncing taking a substantial amount of your cpu usage?
> > > > > >
> > > > > > (+1 this is an important question)
> > >
> > > Sure, I'll drop perf tools to my test env and share the results,
> > > maybe tomorrow or in a few days.
> 
> Oh we-e-e-ell...
> Looks like I've been fooled by I-cache misses or smth like that.
> That happens sometimes, not only on my machines, and not only on
> MIPS if I'm not mistaken.
> Sorry for confusing you guys.
> 
> I got drastically different numbers after I enabled CONFIG_KALLSYMS +
> CONFIG_PERF_EVENTS for perf tools.
> The only difference in code is that I rebased onto Mel's
> mm-bulk-rebase-v6r4.
> 
> (lunar is my WIP NIC driver)
> 
> 1. 5.12-rc3 baseline:
> 
> TCP: 566 Mbps
> UDP: 615 Mbps
> 
> perf top:
>      4.44%  [lunar]              [k] lunar_rx_poll_page_pool
>      3.56%  [kernel]             [k] r4k_wait_irqoff
>      2.89%  [kernel]             [k] free_unref_page
>      2.57%  [kernel]             [k] dma_map_page_attrs
>      2.32%  [kernel]             [k] get_page_from_freelist
>      2.28%  [lunar]              [k] lunar_start_xmit
>      1.82%  [kernel]             [k] __copy_user
>      1.75%  [kernel]             [k] dev_gro_receive
>      1.52%  [kernel]             [k] cpuidle_enter_state_coupled
>      1.46%  [kernel]             [k] tcp_gro_receive
>      1.35%  [kernel]             [k] __rmemcpy
>      1.33%  [nf_conntrack]       [k] nf_conntrack_tcp_packet
>      1.30%  [kernel]             [k] __dev_queue_xmit
>      1.22%  [kernel]             [k] pfifo_fast_dequeue
>      1.17%  [kernel]             [k] skb_release_data
>      1.17%  [kernel]             [k] skb_segment
> 
> free_unref_page() and get_page_from_freelist() consume a lot.
> 
> 2. 5.12-rc3 + Page Pool recycling by Matteo:
> TCP: 589 Mbps
> UDP: 633 Mbps
> 
> perf top:
>      4.27%  [lunar]              [k] lunar_rx_poll_page_pool
>      2.68%  [lunar]              [k] lunar_start_xmit
>      2.41%  [kernel]             [k] dma_map_page_attrs
>      1.92%  [kernel]             [k] r4k_wait_irqoff
>      1.89%  [kernel]             [k] __copy_user
>      1.62%  [kernel]             [k] dev_gro_receive
>      1.51%  [kernel]             [k] cpuidle_enter_state_coupled
>      1.44%  [kernel]             [k] tcp_gro_receive
>      1.40%  [kernel]             [k] __rmemcpy
>      1.38%  [nf_conntrack]       [k] nf_conntrack_tcp_packet
>      1.37%  [kernel]             [k] free_unref_page
>      1.35%  [kernel]             [k] __dev_queue_xmit
>      1.30%  [kernel]             [k] skb_segment
>      1.28%  [kernel]             [k] get_page_from_freelist
>      1.27%  [kernel]             [k] r4k_dma_cache_inv
> 
> +20 Mbps increase on both TCP and UDP. free_unref_page() and
> get_page_from_freelist() dropped down the list significantly.
> 
> 3. 5.12-rc3 + Page Pool recycling + PP bulk allocator (Mel & Jesper):
> TCP: 596
> UDP: 641
> 
> perf top:
>      4.38%  [lunar]              [k] lunar_rx_poll_page_pool
>      3.34%  [kernel]             [k] r4k_wait_irqoff
>      3.14%  [kernel]             [k] dma_map_page_attrs
>      2.49%  [lunar]              [k] lunar_start_xmit
>      1.85%  [kernel]             [k] dev_gro_receive
>      1.76%  [kernel]             [k] free_unref_page
>      1.76%  [kernel]             [k] __copy_user
>      1.65%  [kernel]             [k] inet_gro_receive
>      1.57%  [kernel]             [k] tcp_gro_receive
>      1.48%  [kernel]             [k] cpuidle_enter_state_coupled
>      1.43%  [nf_conntrack]       [k] nf_conntrack_tcp_packet
>      1.42%  [kernel]             [k] __rmemcpy
>      1.25%  [kernel]             [k] skb_segment
>      1.21%  [kernel]             [k] r4k_dma_cache_inv
> 
> +10 Mbps on top of recycling.
> get_page_from_freelist() is gone.
> NAPI polling, CPU idle cycle (r4k_wait_irqoff) and DMA mapping
> routine became the top consumers.

Again, thanks for the extensive testing. 
I assume you dont use page pool to map the buffers right?
Because if the ampping is preserved the only thing you have to do is sync it
after the packet reception

> 
> 4-5. __always_inline for rmqueue_bulk() and __rmqueue_pcplist(),
> removing 'noinline' from net/core/page_pool.c etc.
> 
> ...makes absolutely no sense anymore.
> I see Mel took Jesper's patch to make __rmqueue_pcplist() inline into
> mm-bulk-rebase-v6r5, not sure if it's really needed now.
> 
> So I'm really glad we sorted out the things and I can see the real
> performance improvements from both recycling and bulk allocations.
> 

Those will probably be even bigger with and io(sm)/mu present

[...]

Cheers
/Ilias