netdev - Re: [RFC PATCH 00/24] Introducing AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 1 Feb 2018 17:42:40 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Björn Töpel <bjorn.topel@...il.com>
Cc:     magnus.karlsson@...el.com, alexander.h.duyck@...el.com,
        alexander.duyck@...il.com, john.fastabend@...il.com, ast@...com,
        willemdebruijn.kernel@...il.com, daniel@...earbox.net,
        netdev@...r.kernel.org,
        Björn Töpel <bjorn.topel@...el.com>,
        michael.lundkvist@...csson.com, jesse.brandeburg@...el.com,
        anjali.singhai@...el.com, jeffrey.b.shaw@...el.com,
        ferruh.yigit@...el.com, qi.z.zhang@...el.com, brouer@...hat.com,
        Saeed Mahameed <saeedm@...lanox.com>
Subject: Re: [RFC PATCH 00/24] Introducing AF_XDP support

On Wed, 31 Jan 2018 14:53:32 +0100 Björn Töpel <bjorn.topel@...il.com> wrote:

> * In this RFC, do not use an XDP_REDIRECT action other than
>   bpf_xdpsk_redirect for XDP_DRV_ZC. This is because a zero-copy
>   allocated buffer will then be sent to a cpu id / queue_pair through
>   ndo_xdp_xmit that does not know this has been ZC allocated. It will
>   then do a page_free on it and you will get a crash. How to extend
>   ndo_xdp_xmit with some free/completion function that could be called
>   instead of page_free?  Hopefully, the same solution can be used here
>   as in the first problem item in this section.

I'm prototype-coding extending ndo_xdp_xmit with a free/completion
function call, that look at the xdp_rxq_info to determine what
allocator type the RX-NIC used (info per RXq), and invoke the
appropriate callback.

I dusted off my old page_pool implementation (modifying it to run
outside page-allocator).  Implemented XDP_REDIRECT for mlx5, and
extended xdp_rxq_info, and stored needed info in ixgbe for DMA TX
completion.  Disabled the mlx5 page cache, and instead use the
page_pool.

It worked surprisingly well... test is: pktgen on mlx5 100Gbit/s NIC,
and XDP_REDIRECT with xdp_redirect_map sample, out 10G ixgbe NIC.

Performance is surprisingly good... Testing DMA-TX completion on
ixgbe, that calls "xdp_return_frame", which is mapped to
page_pool_put_page(pool, page); Here DMA-TX-completion runs on CPU#3
and mlx5 RX runs on CPU#0.  (Internally page_pool uses ptr_ring, which
is what gives the good cross CPU performance).

Show adapter(s) (ixgbe2 mlx5p2) statistics (ONLY that changed!)
Ethtool(ixgbe2  ) stat:    810562253 (    810,562,253) <= tx_bytes /sec
Ethtool(ixgbe2  ) stat:    864600261 (    864,600,261) <= tx_bytes_nic /sec
Ethtool(ixgbe2  ) stat:     13509371 (     13,509,371) <= tx_packets /sec
Ethtool(ixgbe2  ) stat:     13509380 (     13,509,380) <= tx_pkts_nic /sec
Ethtool(mlx5p2  ) stat:     36827369 (     36,827,369) <= rx_64_bytes_phy /sec
Ethtool(mlx5p2  ) stat:   2356953271 (  2,356,953,271) <= rx_bytes_phy /sec
Ethtool(mlx5p2  ) stat:     23313782 (     23,313,782) <= rx_discards_phy /sec
Ethtool(mlx5p2  ) stat:         3019 (          3,019) <= rx_out_of_buffer /sec
Ethtool(mlx5p2  ) stat:     36827395 (     36,827,395) <= rx_packets_phy /sec
Ethtool(mlx5p2  ) stat:   2356924099 (  2,356,924,099) <= rx_prio0_bytes /sec
Ethtool(mlx5p2  ) stat:     13513560 (     13,513,560) <= rx_prio0_packets /sec
Ethtool(mlx5p2  ) stat:    810820253 (    810,820,253) <= rx_vport_unicast_bytes /sec
Ethtool(mlx5p2  ) stat:     13513672 (     13,513,672) <= rx_vport_unicast_packets /sec

If I only disabled the mlx5 page cache (no page_pool), then single flow
performance was 6Mpps, and if I started two flows the collective
performance drop to 4Mpps, because we hit the page allocator lock
(further negative scaling occurs).

If I keep the mlx5 cache, I see between 7-11Mpps... which varies
depending on ixgbe TX-ring size and DMA-completion interrupt levels.

For AF_XDP, we just register another free/completion callback function.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer