netdev - Re: Bad XDP performance with mlx5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <19ca7cd9a878b2ecc593cd2838b8ae0412463593.camel@mellanox.com>
Date:   Fri, 31 May 2019 18:06:01 +0000
From:   Saeed Mahameed <saeedm@...lanox.com>
To:     "barbette@....se" <barbette@....se>,
        "brouer@...hat.com" <brouer@...hat.com>
CC:     "toke@...hat.com" <toke@...hat.com>,
        "xdp-newbies@...r.kernel.org" <xdp-newbies@...r.kernel.org>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Leon Romanovsky <leonro@...lanox.com>,
        Tariq Toukan <tariqt@...lanox.com>
Subject: Re: Bad XDP performance with mlx5

On Fri, 2019-05-31 at 18:18 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette <barbette@....se>
> wrote:
> 
> > CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a
> > huge 
> > CPU increase on CX5 when introducing a XDP program.
> > 
> > See https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be
> > around 0:40. We're talking something like 15% while it's near 0 for
> > other drivers. The machine is a recent Skylake. For us it makes XDP
> > unusable. Is that a known problem?
> 

The question is, On the same packet rate/bandwidth do you see higher
cpu utilization on mlx5 compared to other drivers? you have to compare
apples to apples.


> I have a similar test setup, and I can reproduce. I have found the
> root-cause see below.  But on my system it was even worse, with an
> XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see
> 100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58%
> CPU (43% idle) and total 89.7 Gbits/sec.
> 
>  
> > I wonder if it doesn't simply come from mlx5/en_main.c:
> > rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL :
> > DMA_FROM_DEVICE;
> > 
> 
> Nope, that is not the problem.
> 
> > Which would be inline from my observation that memory access seems 
> > heavier. I guess this is for the XDP_TX case.
> > 
> > If this is indeed the problem. Any chance we can:
> > a) detect automatically that a program will not return XDP_TX (I'm
> > not 
> > quite sure about what the BPF limitations allow to guess in
> > advance) or
> > b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in 
> > performance when not needed?
> 
> This was kind of hard to root-cause, but I solved it by increasing
> the TCP
> socket size used by the iperf tool, like this (please reproduce):
> 
> $ iperf -s --window 4M
> ------------------------------------------------------------
> Server listening on TCP port 5001
> TCP window size:  416 KByte (WARNING: requested 4.00 MByte)
> ------------------------------------------------------------
> 
> Given I could reproduce, I took at closer look at perf record/report
> stats,
> and it was actually quite clear that this was related to stalling on
> getting
> pages from the page allocator (function calls top#6
> get_page_from_freelist
> and free_pcppages_bulk).
> 
> Using my tool: ethtool_stats.pl
>  
> https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
> 
> It was clear that the mlx5 driver page-cache was not working:
>  Ethtool(mlx5p1  ) stat:     6653761 (   6,653,761) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:     6653732 (   6,653,732) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:      669481 (     669,481) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:           1 (           1) <= rx_congst_umr
> /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <=
> rx_csum_unnecessary /sec
>  Ethtool(mlx5p1  ) stat:        1034 (       1,034) <=
> rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat:     7323230 (   7,323,230) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7324244 (   7,324,244) <= rx_packets_phy
> /sec
> 
> While the non-XDP case looked like this:
>  Ethtool(mlx5p1  ) stat:      298929 (     298,929) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:      298971 (     298,971) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:     3548789 (   3,548,789) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <=
> rx_csum_complete /sec
>  Ethtool(mlx5p1  ) stat:     7695476 (   7,695,476) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7695169 (   7,695,169) <= rx_packets_phy
> /sec
> Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44
> 
> With the increased TCP window size, the mlx5 driver cache is working
> better,
> but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU
> usage.
>  Ethtool(mlx5p1  ) stat:      894438 (     894,438) <= rx_cache_busy
> /sec
>  Ethtool(mlx5p1  ) stat:      894453 (     894,453) <= rx_cache_full
> /sec
>  Ethtool(mlx5p1  ) stat:     6638518 (   6,638,518) <= rx_cache_reuse
> /sec
>  Ethtool(mlx5p1  ) stat:           6 (           6) <= rx_congst_umr
> /sec
>  Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <=
> rx_csum_unnecessary /sec
>  Ethtool(mlx5p1  ) stat:         164 (         164) <=
> rx_discards_phy /sec
>  Ethtool(mlx5p1  ) stat:     7532983 (   7,532,983) <= rx_packets
> /sec
>  Ethtool(mlx5p1  ) stat:     7533193 (   7,533,193) <= rx_packets_phy
> /sec
> Manual consistence calc: 7532983-(6638518+894453) = 12
> 
> To understand why this is happening, you first have to know that the
> difference is between the two RX-memory modes used by mlx5 for non-
> XDP vs
> XDP. With non-XDP two frames are stored per memory-page, while for
> XDP only
> a single frame per page is used.  The packets available in the RX-
> rings are
> actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024.
> 

Thanks Jesper ! this was a well put together explanation.
I want to point out that some other drivers are using alloc_skb APIs
which provide a good caching mechanism, which is even better than the
mlx5 internal one (which uses the alloc_page APIs directly), this can
explain the difference, and your explanation shows the root cause of
the higher cpu util with XDP on mlx5, since the mlx5 page cache works
with half of its capacity when enabling XDP.

Now do we really need to keep this page per packet in mlx5 when XDP is
enabled ? i think it is time to drop that .. 

> I believe, the real issue is that TCP use the SKB->truesize (based on
> frame
> size) for different memory pressure and window calculations, which is
> why it
> solved the issue to increase the window size manually.
>