netdev - Re: Bypass at packet-page level (Was: Optimizing instruction-cache, more packets at each stage)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1453985663.7627.15.camel@edumazet-glaptop2.roam.corp.google.com>
Date:	Thu, 28 Jan 2016 04:54:23 -0800
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Jesper Dangaard Brouer <brouer@...hat.com>
Cc:	Alexei Starovoitov <alexei.starovoitov@...il.com>,
	John Fastabend <john.fastabend@...il.com>,
	Tom Herbert <tom@...bertland.com>,
	"Michael S. Tsirkin" <mst@...hat.com>,
	David Miller <davem@...emloft.net>,
	Or Gerlitz <gerlitz.or@...il.com>,
	Eric Dumazet <edumazet@...gle.com>,
	Linux Kernel Network Developers <netdev@...r.kernel.org>,
	Alexander Duyck <alexander.duyck@...il.com>,
	Daniel Borkmann <borkmann@...earbox.net>,
	Marek Majkowski <marek@...udflare.com>,
	Hannes Frederic Sowa <hannes@...essinduktion.org>,
	Florian Westphal <fw@...len.de>,
	Paolo Abeni <pabeni@...hat.com>,
	John Fastabend <john.r.fastabend@...el.com>,
	Amir Vadai <amirva@...il.com>,
	Daniel Borkmann <daniel@...earbox.net>,
	Vladislav Yasevich <vyasevich@...il.com>
Subject: Re: Bypass at packet-page level (Was: Optimizing instruction-cache,
 more packets at each stage)

On Thu, 2016-01-28 at 10:52 +0100, Jesper Dangaard Brouer wrote:

> I'm still in flux/undecided how long we should delay the first touching
> of pkt-data, which happens when calling eth_type_trans().  Should it
> stay in the driver or not(?).
> 
> In the extreme case, for optimize for RPS sending to remote CPUs, delay
> calling eth_type_trans() as long as possible.
> 
> 1. In driver only start prefetch data to L2/L3 cache
> 2. Stack calls get_rps_cpu() and assume skb_get_hash() have HW hash
> 3. (Bulk) enqueue on remote_cpu->sd->input_pkt_queue
> 4. On remote CPU in process_backlog call eth_type_trans() on sd->input_pkt_queue
> 
> 
> On the other hand, if the HW desc can provide skb->proto, and we can
> lazy eval skb->pkt_type, then it is okay to keep that responsibility in
> the driver (as the call to eth_type_trans() basically disappears).


Delaying means GRO wont be able to recycle its super hot skb (see
napi_get_frags())

You might optimize the reception of packets in the router case (poor GRO
aggregation rate), but you'll slow down GRO efficiency when receiving
nice GRO trains.

When we receive a train of 10 MSS, driver keeps using the same sk_buff,
very hot in its L1

(This was the original idea of build_skb() to get nice cache locality
for the metadata, since it is 4 cache lines per sk_buff)

Now most drivers have no clue why it is important to allocate the skb
_after_ receiving the ethernet frame and not in advance.

(The lazy drivers allocate ~1024 skbs to prefill their ~1024 slot RX
ring)