netdev - Re: Initial thoughts on TXDP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALx6S350eCYS63dGiR+X+nVvqF_uGJop9Z_m7SmZf9QXr-rrfg@mail.gmail.com>
Date:   Thu, 1 Dec 2016 15:46:40 -0800
From:   Tom Herbert <tom@...bertland.com>
To:     Hannes Frederic Sowa <hannes@...essinduktion.org>
Cc:     Florian Westphal <fw@...len.de>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        Jesper Dangaard Brouer <jbrouer@...hat.com>
Subject: Re: Initial thoughts on TXDP

On Thu, Dec 1, 2016 at 2:47 PM, Hannes Frederic Sowa
<hannes@...essinduktion.org> wrote:
> Side note:
>
> On 01.12.2016 20:51, Tom Herbert wrote:
>>> > E.g. "mini-skb": Even if we assume that this provides a speedup
>>> > (where does that come from? should make no difference if a 32 or
>>> >  320 byte buffer gets allocated).
>>> >
>> It's the zero'ing of three cache lines. I believe we talked about that
>> as netdev.
>
> Jesper and me played with that again very recently:
>
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_memset.c#L590
>
> In micro-benchmarks we saw a pretty good speed up not using the rep
> stosb generated by gcc builtin but plain movq's. Probably the cost model
> for __builtin_memset in gcc is wrong?
>
> When Jesper is free we wanted to benchmark this and maybe come up with a
> arch specific way of cleaning if it turns out to really improve throughput.
>
> SIMD instructions seem even faster but the kernel_fpu_begin/end() kill
> all the benefits.
>
One nice direction of XDP is that it forces drivers to defer
allocating (and hence zero'ing) skbs. In the receive path I think we
can exploit this property deeper into the stack. The only time we
_really_ to allocate an skbuf is when we need to put the packet onto a
queue. All the other use cases are really just to pass a structure
containing a packet from function to function. For that purpose we
should be able to just pass a much smaller structure in a stack
argument and only allocate an skbuff when we need to enqueue. In cases
where we don't ever queue a packet we might never need to allocate any
skbuff-- this includes pure acks, packets that end up being dropped.
But even more than that, if a received packet generates a TX packet
(like a SYN causes a SYN-ACK) then we might even be able to just
recycle the received packet and avoid needing any skbuff allocation on
transmit (XDP_TX already does this in a limited context)--  this could
be a win to handle SYN attacks for instance. Also, since we don't
queue on the socket buffer for UDP it's conceivable we could avoid
skbuffs in an expedited UDP TX path.

Currently, nearly the whole stack depends on packets always being
passed in skbuffs, however __skb_flow_dissect is an interesting
exception as it can handle packets passed in either an skbuff or by
just a void *-- so we know that this "dual mode" is at least possible.
Trying to retrain the whole stack to be able to handle both skbuffs
and raw pages is probably untenable at this point, but selectively
augmenting some critical performance functions for dual mode (ip_rcv,
tcp_rcv, udp_rcv functions for instance) might work.

Thanks,
Tom

> Bye,
> Hannes
>