netdev - Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170214194615.3feddd07@redhat.com>
Date:   Tue, 14 Feb 2017 19:46:15 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Alexander Duyck <alexander.duyck@...il.com>
Cc:     Tariq Toukan <ttoukan.linux@...il.com>,
        Eric Dumazet <edumazet@...gle.com>,
        "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Martin KaFai Lau <kafai@...com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Brenden Blanco <bblanco@...mgrid.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Eric Dumazet <eric.dumazet@...il.com>,
        linux-mm <linux-mm@...ck.org>, brouer@...hat.com
Subject: Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

On Tue, 14 Feb 2017 09:29:54 -0800
Alexander Duyck <alexander.duyck@...il.com> wrote:

> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@...il.com> wrote:
> >
> >
> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >>
> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> <brouer@...hat.com> wrote:
> >>  
> >>> It is important to understand that there are two cases for the cost of
> >>> an atomic op, which depend on the cache-coherency state of the
> >>> cacheline.
> >>>
> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >>>
> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >>>  
> >> Okay, it seems you guys really want a patch that I said was not giving
> >> good results
> >>
> >> Let me publish the numbers I get , adding or not the last (and not
> >> official) patch.
> >>
> >> If I _force_ the user space process to run on the other node,
> >> then the results are not the ones Alex or you are expecting.
> >>
> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> and 3.5 Mpps without it.
> >>
> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> >> 0.00      0.00      4.50
> >>
> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >>
> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
> >>      31.07%  [kernel]      [k] skb_release_data
> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
> >>       1.35%  [kernel]      [k] copy_page_to_iter
> >>       0.98%  [kernel]      [k] _raw_spin_lock
> >>       0.90%  [kernel]      [k] skb_release_head_state
> >>       0.60%  [kernel]      [k] tcp_transmit_skb
> >>       0.51%  [kernel]      [k] mlx4_en_xmit
> >>       0.33%  [kernel]      [k] ___cache_free
> >>       0.28%  [kernel]      [k] tcp_rcv_established
> >>
> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >>
> >>
> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
> >>      12.92%  [kernel]          [k] napi_gro_frags
> >>       7.28%  [kernel]          [k] inet_gro_receive
> >>       7.17%  [kernel]          [k] tcp_gro_receive
> >>       5.10%  [kernel]          [k] dev_gro_receive
> >>       4.87%  [kernel]          [k] skb_gro_receive
> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >>       2.04%  [kernel]          [k] __build_skb
> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
> >>       1.01%  [kernel]          [k] tcp4_gro_receive
> >>       0.65%  [kernel]          [k] kmem_cache_alloc
> >>       0.45%  [kernel]          [k] _raw_spin_lock
> >>
> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> reads.
> >>
> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> >> 0.00      0.00      4.50
> >>
> >> Profiles of the two cpus :
> >>
> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
> >>       6.42%  [kernel]      [k] skb_release_data
> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
> >>       1.83%  [kernel]      [k] copy_page_to_iter
> >>       1.59%  [kernel]      [k] _raw_spin_lock
> >>       1.48%  [kernel]      [k] skb_release_head_state
> >>       0.72%  [kernel]      [k] tcp_transmit_skb
> >>       0.68%  [kernel]      [k] mlx4_en_xmit
> >>       0.43%  [kernel]      [k] page_frag_free
> >>       0.38%  [kernel]      [k] ___cache_free
> >>       0.37%  [kernel]      [k] tcp_established_options
> >>       0.37%  [kernel]      [k] __ip_local_out
> >>
> >>
> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
> >>      26.47%  [kernel]          [k] napi_gro_frags
> >>       7.02%  [kernel]          [k] inet_gro_receive
> >>       5.89%  [kernel]          [k] tcp_gro_receive
> >>       5.17%  [kernel]          [k] dev_gro_receive
> >>       4.80%  [kernel]          [k] skb_gro_receive
> >>       2.61%  [kernel]          [k] __build_skb
> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
> >>       0.95%  [kernel]          [k] tcp4_gro_receive
> >>       0.51%  [kernel]          [k] kmem_cache_alloc
> >>       0.42%  [kernel]          [k] __inet_lookup_established
> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
> >>
> >>
> >> So probably this will need further analysis, outside of the scope of
> >> this patch series.
> >>
> >> Could we now please Ack this v3 and merge it ?
> >>
> >> Thanks.  
> >
> > Thanks Eric.
> >
> > As the previous series caused hangs, we must run functional regression tests
> > over this series as well.
> > Run has already started, and results will be available tomorrow morning.
> >
> > In general, I really like this series. The re-factorization looks more
> > elegant and more correct, functionally.
> >
> > However, performance wise: we fear that the numbers will be drastically
> > lower with this transition to order-0 pages,
> > because of the (becoming critical) page allocator and dma operations
> > bottlenecks, especially on systems with costly
> > dma operations, such as ARM, iommu=on, etc...  
> 
> So to give you an idea I originally came up with the approach used in
> the Intel drivers when I was dealing with a PowerPC system where the
> IOMMU was a requirement.  With this setup correctly the map/unmap
> calls should be almost non-existent.  Basically the only time we
> should have to allocate or free a page is if something is sitting on
> the pages for an excessively long time or if the interrupt is bouncing
> between memory nodes which forces us to free the memory since it is
> sitting on the wrong node.
>
> > We already have this exact issue in mlx5, where we moved to order-0
> > allocations with a fixed size cache, but that was not enough.
> > Customers of mlx5 have already complained about the performance degradation,
> > and currently this is hurting our business.
> > We get a clear nack from our performance regression team regarding doing the
> > same in mlx4.  
> 
> What form of recycling were you doing?  If you were doing the offset
> based setup then that obviously only gets you so much.  The advantage
> to using the page count is that you get almost a mobius strip effect
> for your buffers and descriptor rings where you just keep flipping the
> page offset back and forth via an XOR and the only cost is
> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
> do the full allocation, mapping, and unmapping.
> 
> > So, the question is, can we live with this degradation until those
> > bottleneck challenges are addressed?
> > Following our perf experts feedback, I cannot just simply Ack. We need to
> > have a clear plan to close the perf gap or reduce the impact.  
> 
> I think we need to define what is the degradation you are expecting to
> see.  Using the page count based approach should actually be better in
> some cases than the order 3 page and just walking frags approach since
> the theoretical reuse is infinite instead of fixed.
> 
> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
> > mechanisms in the driver,
> > and together they totally bridge the performance gap.
> > That's why I would like to hear from Jesper what is the status of his
> > page_pool API, it is promising and could totally solve these issues.
> >
> > Regards,
> > Tariq  
> 
> The page pool may provide gains but we have to actually see it before
> we can guarantee it.  If worse comes to worse I might just resort to
> standardizing the logic used for the Intel driver page count based
> approach.  Then if nothing else we would have a standard path for all
> the drivers to use if we start going down this route.

With this Intel driver page count based recycle approach, the recycle
size is tied to the size of the RX ring.  As Eric and Tariq discovered.
And for other performance reasons (memory footprint of walking RX ring
data-structures), don't want to increase the RX ring sizes.  Thus, it
create two opposite performance needs.  That is why I think a more
explicit approach with a pool is more attractive.

How is this approach doing to work for XDP?
(XDP doesn't "share" the page, and in-general we don't want the extra
atomic.)

We absolutely need recycling with XDP, when transmitting out another
device, and the other devices DMA-TX completion need some way of
returning this page.
What is basically needed is a standardized callback to allow the remote
driver to return the page to the originating driver.  As we don't have
a NDP for XDP-forward/transmit yet, we could pass this callback as a
parameter along with the packet-page to send?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer