netdev - Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 7 Dec 2017 09:33:43 +0100
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Matthias Tafelmeier <matthias.tafelmeier@....net>
Cc:     Dave Taht <dave@...t.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Joel Wirāmu Pauling <joel@...ertia.net>,
        David Ahern <dsa@...ulusnetworks.com>,
        Tariq Toukan <tariqt@...lanox.com>, brouer@...hat.com,
        Björn Töpel <bjorn.topel@...el.com>
Subject: Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC
 behaviors today)

(Removed bloat-lists to avoid cross ML-posting)

On Mon, 4 Dec 2017 18:19:09 +0100 Matthias Tafelmeier <matthias.tafelmeier@....net> wrote:

> Hello,
> > Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
> > Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
> > But when the drivers page-recycler fails, we hit bottlenecks in the
> > page-allocator, that cause negative scaling to around 43Gbit/s.
> >
> > [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com
> >
> > Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
> > a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
> > but last couple of years the network stack have been optimized (with
> > UDP workloads), and as a result we can do 10G without TSO/GRO on a
> > single-CPU.  This is "only" 812Kpps with MTU size frames.  
> 
> Cannot find the reference anymore, but there was once some workshop held
> by you during some netdev where you were stating that you're practially
> in rigorous exchange with NIC vendors as to having them tremendously
> increase the RX/TX rings(queues) numbers. 

You are mis-quoting me. I have not recommended tremendously increasing
the RX/TX rings(queues) numbers.  Actually, we should likely decrease
number of RX-rings, per recommendation of Eric Dumazet[1], to increase
the chance of packet aggregation/bulking during NAPI-loop.  And use
something like CPUMAP[2] to re-distribute load on CPUs.

[1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
[2] https://git.kernel.org/torvalds/c/452606d6c9cd

You might have heard/seen me talk about increasing the ring queue size.
that is the frames/pages available per RX-ring queue[3][4].  I generally
don't recommend increasing that too much, as it hurts cache-usage.  The
real reason it sometimes helps to increase the RX-ring size on the
Intel based NICs is because they intermix page-recycling into their
RX-ring, which I now added a counter for when it fails[5].

[3] http://netoptimizer.blogspot.dk/2014/10/unlocked-10gbps-tx-wirespeed-smallest.html
[4] http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
[5] https://git.kernel.org/torvalds/c/86e23494222f3

> Further, that there are hardly
> any limits to the number other than FPGA magic/physical HW - up to
> millions is viable was coined back then.  May I ask were this ended up?
> Wouldn't that be key for massive parallelization either - With having a
> queue(producer), a CPU (consumer)  - vice versa - per flow at the
> extreme? Did this end up in this SMART-NIC thingummy? The latter is
> rather trageted at XDP, no?

I do have future plans for (wanting drivers to support) dynamically
adding more RX-TX-queue-pairs.  The general idea is to have NIC HW to
filter packets per application into specific NIC queue number, which
can be mapped directly into an application (and I want a queue-pair to
allow the app to TX also).

I actually imagine that we can do the application steering via
XDP_REDIRECT. And by having application register user-pages, like
AF_PACKET-V4, we can achieve zero-copy into userspace from XDP.  A
subtle trick here is that zero-copy only occurs if the RX-queue number
match (XDP operating at driver ring level could know), meaning that NIC
HW filter setup could happen async (but premapping userspace pages
still have to happen upfront, before starting app/socket).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Content of type "application/pgp-signature" skipped