netdev - Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC behaviors today)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <7aef39ab-decf-a152-2115-c1c07c8722a0@gmx.net>
Date:   Thu, 7 Dec 2017 19:50:47 +0100
From:   Matthias Tafelmeier <matthias.tafelmeier@....net>
To:     Jesper Dangaard Brouer <brouer@...hat.com>
Cc:     Dave Taht <dave@...t.net>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Joel Wirāmu Pauling <joel@...ertia.net>,
        David Ahern <dsa@...ulusnetworks.com>,
        Tariq Toukan <tariqt@...lanox.com>,
        Björn Töpel <bjorn.topel@...el.com>
Subject: Re: [Bloat] Linux network is damn fast, need more use XDP (Was: DC
 behaviors today)



That's the discussion I meant:
https://www.youtube.com/watch?v=vsjxgOpv1n8

Manifold excuses should I've turned your words in any respect. On top,
haven't rewatched it, only putting it here for completeness' sake.

> You are mis-quoting me. I have not recommended tremendously increasing
> the RX/TX rings(queues) numbers.  Actually, we should likely decrease
> number of RX-rings, per recommendation of Eric Dumazet[1], to increase
> the chance of packet aggregation/bulking during NAPI-loop.  And use
> something like CPUMAP[2] to re-distribute load on CPUs.
>
> [1] https://www.netdevconf.org/2.1/papers/BusyPollingNextGen.pdf
> [2] https://git.kernel.org/torvalds/c/452606d6c9cd
Well certainly so for throughut optimizations: allow me to defensively
qualify by saying,  screwing down latency still quite depends on scaling
out rings, at least that's what experience tells me for the NAPI based
approach. Should hold symmetrically for Busy-Polling, though, am only
theoriticising there.

>
> You might have heard/seen me talk about increasing the ring queue size.
> that is the frames/pages available per RX-ring queue[3][4].  I generally
> don't recommend increasing that too much, as it hurts cache-usage.  The
> real reason it sometimes helps to increase the RX-ring size on the
> Intel based NICs is because they intermix page-recycling into their
> RX-ring, which I now added a counter for when it fails[5].
>
> [3] http://netoptimizer.blogspot.dk/2014/10/unlocked-10gbps-tx-wirespeed-smallest.html
> [4] http://netoptimizer.blogspot.dk/2014/06/pktgen-for-network-overload-testing.html
> [5] https://git.kernel.org/torvalds/c/86e23494222f3
>
Presumingly, touching the length shoulb be obsolete ever since DQL,
respectively BQL, anyways.


>> Further, that there are hardly
>> any limits to the number other than FPGA magic/physical HW - up to
>> millions is viable was coined back then.  May I ask were this ended up?
>> Wouldn't that be key for massive parallelization either - With having a
>> queue(producer), a CPU (consumer)  - vice versa - per flow at the
>> extreme? Did this end up in this SMART-NIC thingummy? The latter is
>> rather trageted at XDP, no?
> I do have future plans for (wanting drivers to support) dynamically
> adding more RX-TX-queue-pairs.  The general idea is to have NIC HW to
> filter packets per application into specific NIC queue number, which
> can be mapped directly into an application (and I want a queue-pair to
> allow the app to TX also).
> I actually imagine that we can do the application steering via
> XDP_REDIRECT. And by having application register user-pages, like
> AF_PACKET-V4, we can achieve zero-copy into userspace from XDP.
I understand, working on a sort of in kernel 'virtual' TX-RX-Ring-Pairs
per flow/application.

>   A
> subtle trick here is that zero-copy only occurs if the RX-queue number
> match (XDP operating at driver ring level could know), meaning that NIC
> HW filter setup could happen async (but premapping userspace pages
> still have to happen upfront, before starting app/socket).
>
I see, a more sophisticated, flexible RPS then. That was overdue.


Very much appreciated, thanks!

-- 

Besten Gruß

Matthias Tafelmeier


Download attachment "0x8ADF343B.asc" of type "application/pgp-keys" (4730 bytes)

Download attachment "signature.asc" of type "application/pgp-signature" (539 bytes)