[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190531181817.34039c9f@carbon>
Date: Fri, 31 May 2019 18:18:17 +0200
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Tom Barbette <barbette@....se>
Cc: xdp-newbies@...r.kernel.org,
Toke Høiland-Jørgensen
<toke@...hat.com>, Saeed Mahameed <saeedm@...lanox.com>,
Leon Romanovsky <leonro@...lanox.com>,
Tariq Toukan <tariqt@...lanox.com>, brouer@...hat.com,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Bad XDP performance with mlx5
On Fri, 31 May 2019 08:51:43 +0200 Tom Barbette <barbette@....se> wrote:
> CCing mlx5 maintainers and commiters of bce2b2b. TLDK: there is a huge
> CPU increase on CX5 when introducing a XDP program.
>
> See https://www.youtube.com/watch?v=o5hlJZbN4Tk&feature=youtu.be
> around 0:40. We're talking something like 15% while it's near 0 for
> other drivers. The machine is a recent Skylake. For us it makes XDP
> unusable. Is that a known problem?
I have a similar test setup, and I can reproduce. I have found the
root-cause see below. But on my system it was even worse, with an
XDP_PASS program loaded, and iperf (6 parallel TCP flows) I would see
100% CPU usage and total 83.3 Gbits/sec. With non-XDP case, I saw 58%
CPU (43% idle) and total 89.7 Gbits/sec.
> I wonder if it doesn't simply come from mlx5/en_main.c:
> rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
>
Nope, that is not the problem.
> Which would be inline from my observation that memory access seems
> heavier. I guess this is for the XDP_TX case.
>
> If this is indeed the problem. Any chance we can:
> a) detect automatically that a program will not return XDP_TX (I'm not
> quite sure about what the BPF limitations allow to guess in advance) or
> b) add a flag to such as XDP_FLAGS_NO_TX to avoid such hit in
> performance when not needed?
This was kind of hard to root-cause, but I solved it by increasing the TCP
socket size used by the iperf tool, like this (please reproduce):
$ iperf -s --window 4M
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size: 416 KByte (WARNING: requested 4.00 MByte)
------------------------------------------------------------
Given I could reproduce, I took at closer look at perf record/report stats,
and it was actually quite clear that this was related to stalling on getting
pages from the page allocator (function calls top#6 get_page_from_freelist
and free_pcppages_bulk).
Using my tool: ethtool_stats.pl
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl
It was clear that the mlx5 driver page-cache was not working:
Ethtool(mlx5p1 ) stat: 6653761 ( 6,653,761) <= rx_cache_busy /sec
Ethtool(mlx5p1 ) stat: 6653732 ( 6,653,732) <= rx_cache_full /sec
Ethtool(mlx5p1 ) stat: 669481 ( 669,481) <= rx_cache_reuse /sec
Ethtool(mlx5p1 ) stat: 1 ( 1) <= rx_congst_umr /sec
Ethtool(mlx5p1 ) stat: 7323230 ( 7,323,230) <= rx_csum_unnecessary /sec
Ethtool(mlx5p1 ) stat: 1034 ( 1,034) <= rx_discards_phy /sec
Ethtool(mlx5p1 ) stat: 7323230 ( 7,323,230) <= rx_packets /sec
Ethtool(mlx5p1 ) stat: 7324244 ( 7,324,244) <= rx_packets_phy /sec
While the non-XDP case looked like this:
Ethtool(mlx5p1 ) stat: 298929 ( 298,929) <= rx_cache_busy /sec
Ethtool(mlx5p1 ) stat: 298971 ( 298,971) <= rx_cache_full /sec
Ethtool(mlx5p1 ) stat: 3548789 ( 3,548,789) <= rx_cache_reuse /sec
Ethtool(mlx5p1 ) stat: 7695476 ( 7,695,476) <= rx_csum_complete /sec
Ethtool(mlx5p1 ) stat: 7695476 ( 7,695,476) <= rx_packets /sec
Ethtool(mlx5p1 ) stat: 7695169 ( 7,695,169) <= rx_packets_phy /sec
Manual consistence calc: 7695476-((3548789*2)+(298971*2)) = -44
With the increased TCP window size, the mlx5 driver cache is working better,
but not optimally, see below. I'm getting 88.0 Gbits/sec with 68% CPU usage.
Ethtool(mlx5p1 ) stat: 894438 ( 894,438) <= rx_cache_busy /sec
Ethtool(mlx5p1 ) stat: 894453 ( 894,453) <= rx_cache_full /sec
Ethtool(mlx5p1 ) stat: 6638518 ( 6,638,518) <= rx_cache_reuse /sec
Ethtool(mlx5p1 ) stat: 6 ( 6) <= rx_congst_umr /sec
Ethtool(mlx5p1 ) stat: 7532983 ( 7,532,983) <= rx_csum_unnecessary /sec
Ethtool(mlx5p1 ) stat: 164 ( 164) <= rx_discards_phy /sec
Ethtool(mlx5p1 ) stat: 7532983 ( 7,532,983) <= rx_packets /sec
Ethtool(mlx5p1 ) stat: 7533193 ( 7,533,193) <= rx_packets_phy /sec
Manual consistence calc: 7532983-(6638518+894453) = 12
To understand why this is happening, you first have to know that the
difference is between the two RX-memory modes used by mlx5 for non-XDP vs
XDP. With non-XDP two frames are stored per memory-page, while for XDP only
a single frame per page is used. The packets available in the RX-rings are
actually the same, as the ring sizes are non-XDP=512 vs. XDP=1024.
I believe, the real issue is that TCP use the SKB->truesize (based on frame
size) for different memory pressure and window calculations, which is why it
solved the issue to increase the window size manually.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists