lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAA93jw4yOz2KoJGz4t9KqFrr=Zx+=N_r-c_W9iQCpGCBCgDVgg@mail.gmail.com>
Date:   Mon, 4 Dec 2017 09:00:41 -0800
From:   Dave Taht <dave.taht@...il.com>
To:     Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     Dave Taht <dave@...t.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        bloat <bloat@...ts.bufferbloat.net>,
        Christina Jacob <christina.jacob.koikara@...il.com>,
        Joel Wirāmu Pauling <joel@...ertia.net>,
        "cerowrt-devel@...ts.bufferbloat.net" 
        <cerowrt-devel@...ts.bufferbloat.net>,
        David Ahern <dsa@...ulusnetworks.com>,
        Tariq Toukan <tariqt@...lanox.com>
Subject: Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC
 behaviors today)

Jesper:

I have a tendency to deal with netdev by itself and never cross post
there, as the bufferbloat.net servers (primarily to combat spam)
mandate starttls and vger doesn't support it at all, thus leading to
raising davem blood pressure which I'd rather not do.

But moving on...

On Mon, Dec 4, 2017 at 2:56 AM, Jesper Dangaard Brouer
<brouer@...hat.com> wrote:
>
> On Sun, 03 Dec 2017 20:19:33 -0800 Dave Taht <dave@...t.net> wrote:
>
>> Changing the topic, adding bloat.
>
> Adding netdev, and also adjust the topic to be a rant on that the Linux
> kernel network stack is actually damn fast, and if you need something
> faster then XDP can solved your needs...
>
>> Joel Wirāmu Pauling <joel@...ertia.net> writes:
>>
>> > Just from a Telco/Industry perspective slant.
>> >
>> > Everything in DC has moved to SFP28 interfaces at 25Gbit as the server
>> > port of interconnect. Everything TOR wise is now QSFP28 - 100Gbit.
>> > Mellanox X5 cards are the current hotness, and their offload
>> > enhancements (ASAP2 - which is sorta like DPDK on steroids) allows for
>> > OVS flow rules programming into the card. We have a lot of customers
>> > chomping at the bit for that feature (disclaimer I work for Nuage
>> > Networks, and we are working on enhanced OVS to do just that) for NFV
>> > workloads.
>>
>> What Jesper's been working on for ages has been to try and get linux's
>> PPS up for small packets, which last I heard was hovering at about
>> 4Gbits.
>
> I hope you made a typo here Dave, the normal Linux kernel is definitely
> way beyond 4Gbit/s, you must have misunderstood something, maybe you
> meant 40Gbit/s? (which is also too low)

The context here was PPS for *non-gro'd* tcp ack packets, in the
further context of
the increasingly epic "benefits of ack filtering" thread on the bloat
list, in the context
that for 50x1 end-user-asymmetry we were seeing 90% less acks with the new
sch_cake ack-filter code, double the throughput...

The kind of return traffic you see from data sent outside the DC, with
tons of flows.

What's that number?

>
> Scaling up to more CPUs and TCP-stream, Tariq[1] and I have showed the
> Linux kernel network stack scales to 94Gbit/s (linerate minus overhead).
> But when the drivers page-recycler fails, we hit bottlenecks in the
> page-allocator, that cause negative scaling to around 43Gbit/s.

So I divide by 94/22 and get 4gbit for acks. Or I look at PPS * 66. Or?

> [1] http://lkml.kernel.org/r/cef85936-10b2-5d76-9f97-cb03b418fd94@mellanox.com
>
> Linux have for a _long_ time been doing 10Gbit/s TCP-stream easily, on
> a SINGLE CPU.  This is mostly thanks to TSO/GRO aggregating packets,
> but last couple of years the network stack have been optimized (with
> UDP workloads), and as a result we can do 10G without TSO/GRO on a
> single-CPU.  This is "only" 812Kpps with MTU size frames.

acks.

> It is important to NOTICE that I'm mostly talking about SINGLE-CPU
> performance.  But the Linux kernel scales very well to more CPUs, and
> you can scale this up, although we are starting to hit scalability
> issues in MM-land[1].
>
> I've also demonstrated that netdev-community have optimized the kernels
> per-CPU processing power to around 2Mpps.  What does this really
> mean... well with MTU size packets 812Kpps was 10Gbit/s, thus 25Gbit/s
> should be around 2Mpps.... That implies Linux can do 25Gbit/s on a
> single CPU without GRO (MTU size frames).  Do you need more I ask?

The benchmark I had in mind was, say, 100k flows going out over the internet,
and the characteristics of the ack flows on the return path.

>
>
>> The route table lookup also really expensive on the main cpu.

To clarify the context here, I was asking specifically if the X5 mellonox card
did routing table offlload or only switching.

> Well, it used-to-be very expensive. Vincent Bernat wrote some excellent
> blogposts[2][3] on the recent improvements over kernel versions, and
> gave due credit to people involved.
>
> [2] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv4-route-lookup-linux
> [3] https://vincent.bernat.im/en/blog/2017-performance-progression-ipv6-route-lookup-linux
>
> He measured around 25 to 35 nanosec cost of route lookups.  My own
> recent measurements were 36.9 ns cost of fib_table_lookup.

On intel hw.

>
>> Does this stuff offload the route table lookup also?
>
> If you have not heard, the netdev-community have worked on something
> called XDP (eXpress Data Path).  This is a new layer in the network
> stack, that basically operates a the same "layer"/level as DPDK.
> Thus, surprise we get the same performance numbers as DPDK. E.g. I can
> do 13.4 Mpps forwarding with ixgbe on a single CPU (more CPUs=14.6Mps)
>
> We can actually use XDP for (software) offloading the Linux routing
> table.  There are two methods we are experimenting with:
>
> (1) externally monitor route changes from userspace and update BPF-maps
> to reflect this. That approach is already accepted upstream[4][5].  I'm
> measuring 9,513,746 pps per CPU with that approach.
>
> (2) add a bpf helper to simply call fib_table_lookup() from the XDP hook.
> This is still experimental patches (credit to David Ahern), and I've
> measured 9,350,160 pps with this approach in a single CPU.  Using more
> CPUs we hit 14.6Mpps (only used 3 CPUs in that test)

Neat. Perhaps trying xdp on the itty bitty routers I usually work on
would be a win.
quad arm cores are increasingy common there.

>
> [4] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_user.c
> [5] https://github.com/torvalds/linux/blob/master/samples/bpf/xdp_router_ipv4_kern.c

thx very much for the update.

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer
> _______________________________________________
> Bloat mailing list
> Bloat@...ts.bufferbloat.net
> https://lists.bufferbloat.net/listinfo/bloat



-- 

Dave Täht
CEO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-669-226-2619

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ