netdev - Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 8 Sep 2016 11:52:25 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Or Gerlitz <gerlitz.or@...il.com>
Cc:     Saeed Mahameed <saeedm@...lanox.com>,
        iovisor-dev <iovisor-dev@...ts.iovisor.org>,
        Linux Netdev List <netdev@...r.kernel.org>,
        Tariq Toukan <tariqt@...lanox.com>,
        Brenden Blanco <bblanco@...mgrid.com>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Tom Herbert <tom@...bertland.com>,
        Martin KaFai Lau <kafai@...com>,
        Daniel Borkmann <daniel@...earbox.net>,
        Eric Dumazet <edumazet@...gle.com>,
        Jamal Hadi Salim <jhs@...atatu.com>,
        Rana Shahout <ranas@...lanox.com>, brouer@...hat.com
Subject: Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs
 support

On Thu, 8 Sep 2016 12:31:47 +0300
Or Gerlitz <gerlitz.or@...il.com> wrote:

> On Thu, Sep 8, 2016 at 10:38 AM, Jesper Dangaard Brouer
> <brouer@...hat.com> wrote:
> > On Wed, 7 Sep 2016 23:55:42 +0300
> > Or Gerlitz <gerlitz.or@...il.com> wrote:
> >  
> >> On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm@...lanox.com> wrote:  
> >> > From: Rana Shahout <ranas@...lanox.com>
> >> >
> >> > Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
> >> >
> >> > When XDP is on we make sure to change channels RQs type to
> >> > MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
> >> > ensure "page per packet".
> >> >
> >> > On XDP set, we fail if HW LRO is set and request from user to turn it
> >> > off.  Since on ConnectX4-LX HW LRO is always on by default, this will be
> >> > annoying, but we prefer not to enforce LRO off from XDP set function.
> >> >
> >> > Full channels reset (close/open) is required only when setting XDP
> >> > on/off.
> >> >
> >> > When XDP set is called just to exchange programs, we will update
> >> > each RQ xdp program on the fly and for synchronization with current
> >> > data path RX activity of that RQ, we temporally disable that RQ and
> >> > ensure RX path is not running, quickly update and re-enable that RQ,
> >> > for that we do:
> >> >         - rq.state = disabled
> >> >         - napi_synnchronize
> >> >         - xchg(rq->xdp_prg)
> >> >         - rq.state = enabled
> >> >         - napi_schedule // Just in case we've missed an IRQ
> >> >
> >> > Packet rate performance testing was done with pktgen 64B packets and on
> >> > TX side and, TC drop action on RX side compared to XDP fast drop.
> >> >
> >> > CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> >> >
> >> > Comparison is done between:
> >> >         1. Baseline, Before this patch with TC drop action
> >> >         2. This patch with TC drop action
> >> >         3. This patch with XDP RX fast drop
> >> >
> >> > Streams    Baseline(TC drop)    TC drop    XDP fast Drop
> >> > --------------------------------------------------------------
> >> > 1           5.51Mpps            5.14Mpps     13.5Mpps  
> >>
> >> This (13.5 M PPS) is less than 50% of the result we presented @ the
> >> XDP summit which was obtained by Rana. Please see if/how much does
> >> this grows if you use more sender threads, but all of them to xmit the
> >> same stream/flows, so we're on one ring. That (XDP with single RX ring
> >> getting packets from N remote TX rings) would be your canonical
> >> base-line for any further numbers.  
> >
> > Well, my experiments with this hardware (mlx5/CX4 at 50Gbit/s) show
> > that you should be able to reach 23Mpps on a single CPU.  This is
> > a XDP-drop-simulation with order-0 pages being recycled through my
> > page_pool code, plus avoiding the cache-misses (notice you are using a
> > CPU E5-2680 with DDIO, thus you should only see a L3 cache miss).  
> 
> so this takes up from 13M to 23M, good.

Notice the 23Mpps was crude hack test to determine the maximum
achievable performance.  This is our performance target, once we get
_close_ to that then we are happy, and stop optimizing.

> Could you explain why the move from order-3 to order-0 is hurting the
> performance so much (drop from 32M to 23M), any way we can overcome that?

It is all going to be in the details.

When reaching these numbers be careful, thinking wow 23M to 32M sounds
like a huge deal... but the performance difference in nanosec is
actually not that large, it is only around 12ns more we have to save.

(1/(23*10^6)-1/(32*10^6))*10^9 = 12.22

> > The 23Mpps number looks like some HW limitation, as the increase was  
> 
> not HW, I think. As I said, Rana got 32M with striding RQ when she was
> using order-3 (or did we use order-5?)

It was order-5.

We likely need some HW tuning parameter (like with mlx4) if you want to
go past the 23Mpps mark.

 
> > is not proportional to page-allocator overhead I removed (and CPU freq
> > starts to decrease).  I also did scaling tests to more CPUs, which
> > showed it scaled up to 40Mpps (you reported 45M).  And at the Phy RX
> > level I see 60Mpps (50G max is 74Mpps).  

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer