netdev - Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 14 Feb 2017 09:29:54 -0800
From:   Alexander Duyck <alexander.duyck@...il.com>
To:     Tariq Toukan <ttoukan.linux@...il.com>
Cc:     Eric Dumazet <edumazet@...gle.com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Martin KaFai Lau <kafai@...com>,
        Saeed Mahameed <saeedm@...lanox.com>,
        Willem de Bruijn <willemb@...gle.com>,
        Brenden Blanco <bblanco@...mgrid.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Eric Dumazet <eric.dumazet@...il.com>,
        linux-mm <linux-mm@...ck.org>
Subject: Re: [PATCH v3 net-next 08/14] mlx4: use order-0 pages for RX

On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.linux@...il.com> wrote:
>
>
> On 14/02/2017 3:45 PM, Eric Dumazet wrote:
>>
>> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
>> <brouer@...hat.com> wrote:
>>
>>> It is important to understand that there are two cases for the cost of
>>> an atomic op, which depend on the cache-coherency state of the
>>> cacheline.
>>>
>>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
>>>
>>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
>>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
>>>
>> Okay, it seems you guys really want a patch that I said was not giving
>> good results
>>
>> Let me publish the numbers I get , adding or not the last (and not
>> official) patch.
>>
>> If I _force_ the user space process to run on the other node,
>> then the results are not the ones Alex or you are expecting.
>>
>> I have with this patch about 2.7 Mpps of this silly single TCP flow,
>> and 3.5 Mpps without it.
>>
>> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
>> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
>> 0.00      0.00      4.50
>>
>> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
>>
>>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
>>      31.07%  [kernel]      [k] skb_release_data
>>       4.24%  [kernel]      [k] skb_copy_datagram_iter
>>       1.35%  [kernel]      [k] copy_page_to_iter
>>       0.98%  [kernel]      [k] _raw_spin_lock
>>       0.90%  [kernel]      [k] skb_release_head_state
>>       0.60%  [kernel]      [k] tcp_transmit_skb
>>       0.51%  [kernel]      [k] mlx4_en_xmit
>>       0.33%  [kernel]      [k] ___cache_free
>>       0.28%  [kernel]      [k] tcp_rcv_established
>>
>> Profile of cpu handling mlx4 softirqs (NUMA node 0)
>>
>>
>>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
>>      12.92%  [kernel]          [k] napi_gro_frags
>>       7.28%  [kernel]          [k] inet_gro_receive
>>       7.17%  [kernel]          [k] tcp_gro_receive
>>       5.10%  [kernel]          [k] dev_gro_receive
>>       4.87%  [kernel]          [k] skb_gro_receive
>>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>>       2.04%  [kernel]          [k] __build_skb
>>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
>>       1.01%  [kernel]          [k] tcp4_gro_receive
>>       0.65%  [kernel]          [k] kmem_cache_alloc
>>       0.45%  [kernel]          [k] _raw_spin_lock
>>
>> Without the latest  patch (the exact patch series v3 I submitted),
>> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
>> reads.
>>
>> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
>> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
>> 0.00      0.00      4.50
>>
>> Profiles of the two cpus :
>>
>>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
>>       6.42%  [kernel]      [k] skb_release_data
>>       5.65%  [kernel]      [k] skb_copy_datagram_iter
>>       1.83%  [kernel]      [k] copy_page_to_iter
>>       1.59%  [kernel]      [k] _raw_spin_lock
>>       1.48%  [kernel]      [k] skb_release_head_state
>>       0.72%  [kernel]      [k] tcp_transmit_skb
>>       0.68%  [kernel]      [k] mlx4_en_xmit
>>       0.43%  [kernel]      [k] page_frag_free
>>       0.38%  [kernel]      [k] ___cache_free
>>       0.37%  [kernel]      [k] tcp_established_options
>>       0.37%  [kernel]      [k] __ip_local_out
>>
>>
>>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
>>      26.47%  [kernel]          [k] napi_gro_frags
>>       7.02%  [kernel]          [k] inet_gro_receive
>>       5.89%  [kernel]          [k] tcp_gro_receive
>>       5.17%  [kernel]          [k] dev_gro_receive
>>       4.80%  [kernel]          [k] skb_gro_receive
>>       2.61%  [kernel]          [k] __build_skb
>>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
>>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
>>       0.95%  [kernel]          [k] tcp4_gro_receive
>>       0.51%  [kernel]          [k] kmem_cache_alloc
>>       0.42%  [kernel]          [k] __inet_lookup_established
>>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
>>
>>
>> So probably this will need further analysis, outside of the scope of
>> this patch series.
>>
>> Could we now please Ack this v3 and merge it ?
>>
>> Thanks.
>
> Thanks Eric.
>
> As the previous series caused hangs, we must run functional regression tests
> over this series as well.
> Run has already started, and results will be available tomorrow morning.
>
> In general, I really like this series. The re-factorization looks more
> elegant and more correct, functionally.
>
> However, performance wise: we fear that the numbers will be drastically
> lower with this transition to order-0 pages,
> because of the (becoming critical) page allocator and dma operations
> bottlenecks, especially on systems with costly
> dma operations, such as ARM, iommu=on, etc...

So to give you an idea I originally came up with the approach used in
the Intel drivers when I was dealing with a PowerPC system where the
IOMMU was a requirement.  With this setup correctly the map/unmap
calls should be almost non-existent.  Basically the only time we
should have to allocate or free a page is if something is sitting on
the pages for an excessively long time or if the interrupt is bouncing
between memory nodes which forces us to free the memory since it is
sitting on the wrong node.

> We already have this exact issue in mlx5, where we moved to order-0
> allocations with a fixed size cache, but that was not enough.
> Customers of mlx5 have already complained about the performance degradation,
> and currently this is hurting our business.
> We get a clear nack from our performance regression team regarding doing the
> same in mlx4.

What form of recycling were you doing?  If you were doing the offset
based setup then that obviously only gets you so much.  The advantage
to using the page count is that you get almost a mobius strip effect
for your buffers and descriptor rings where you just keep flipping the
page offset back and forth via an XOR and the only cost is
dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
do the full allocation, mapping, and unmapping.

> So, the question is, can we live with this degradation until those
> bottleneck challenges are addressed?
> Following our perf experts feedback, I cannot just simply Ack. We need to
> have a clear plan to close the perf gap or reduce the impact.

I think we need to define what is the degradation you are expecting to
see.  Using the page count based approach should actually be better in
some cases than the order 3 page and just walking frags approach since
the theoretical reuse is infinite instead of fixed.

> Internally, I already implemented "dynamic page-cache" and "page-reuse"
> mechanisms in the driver,
> and together they totally bridge the performance gap.
> That's why I would like to hear from Jesper what is the status of his
> page_pool API, it is promising and could totally solve these issues.
>
> Regards,
> Tariq

The page pool may provide gains but we have to actually see it before
we can guarantee it.  If worse comes to worse I might just resort to
standardizing the logic used for the Intel driver page count based
approach.  Then if nothing else we would have a standard path for all
the drivers to use if we start going down this route.

- Alex