netdev - Re: [net-next V7 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 12 Oct 2017 22:13:43 +0100
From:   Edward Cree <ecree@...arflare.com>
To:     Jesper Dangaard Brouer <brouer@...hat.com>,
        <netdev@...r.kernel.org>
CC:     <jakub.kicinski@...ronome.com>,
        "Michael S. Tsirkin" <mst@...hat.com>, <pavel.odintsov@...il.com>,
        Jason Wang <jasowang@...hat.com>, <mchan@...adcom.com>,
        John Fastabend <john.fastabend@...il.com>,
        <peter.waskiewicz.jr@...el.com>, <ast@...erby.dk>,
        Daniel Borkmann <borkmann@...earbox.net>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Andy Gospodarek <andy@...yhouse.net>
Subject: Re: [net-next V7 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion
 and allocation

On 12/10/17 13:26, Jesper Dangaard Brouer wrote:
> This patch makes cpumap functional, by adding SKB allocation and
> invoking the network stack on the dequeuing CPU.
>
> For constructing the SKB on the remote CPU, the xdp_buff in converted
> into a struct xdp_pkt, and it mapped into the top headroom of the
> packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
> just a cpumap internal data structure, with info carried between
> enqueue to dequeue.

<snip>

> +struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
> +				  struct xdp_pkt *xdp_pkt)
> +{
> +	unsigned int frame_size;
> +	void *pkt_data_start;
> +	struct sk_buff *skb;
> +
> +	/* build_skb need to place skb_shared_info after SKB end, and
> +	 * also want to know the memory "truesize".  Thus, need to
> +	 * know the memory frame size backing xdp_buff.
> +	 *
> +	 * XDP was designed to have PAGE_SIZE frames, but this
> +	 * assumption is not longer true with ixgbe and i40e.  It
> +	 * would be preferred to set frame_size to 2048 or 4096
> +	 * depending on the driver.
> +	 *   frame_size = 2048;
> +	 *   frame_len  = frame_size - sizeof(*xdp_pkt);
> +	 *
> +	 * Instead, with info avail, skb_shared_info in placed after
> +	 * packet len.  This, unfortunately fakes the truesize.
> +	 * Another disadvantage of this approach, the skb_shared_info
> +	 * is not at a fixed memory location, with mixed length
> +	 * packets, which is bad for cache-line hotness.
> +	 */
> +	frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
> +		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +
> +	pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
> +	skb = build_skb(pkt_data_start, frame_size);
> +	if (!skb)
> +		return NULL;
> +
> +	skb_reserve(skb, xdp_pkt->headroom);
> +	__skb_put(skb, xdp_pkt->len);
> +	if (xdp_pkt->metasize)
> +		skb_metadata_set(skb, xdp_pkt->metasize);
> +
> +	/* Essential SKB info: protocol and skb->dev */
> +	skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
> +
> +	/* Optional SKB info, currently missing:
> +	 * - HW checksum info		(skb->ip_summed)
> +	 * - HW RX hash			(skb_set_hash)
> +	 * - RX ring dev queue index	(skb_record_rx_queue)
> +	 */
One possibility for dealing with these and related issues — also things
 like the proper way to free an xdp_buff if SKB creation fails, which
 might not be page_frag_free() for some drivers with unusual recycle ring
 implementations — is to have a new ndo for 'receiving' an xdp_pkt from a
 cpumap redirect.
Since you're always receiving from the same driver that enqueued it, even
 the structure of the metadata stored in the top of the packet page
 doesn't have to be standardised; instead, each driver can put there just
 whatever happens to be needed for its ndo_xdp_rx routine.  (Though there
 would probably be standard enqueue and dequeue functions that the
 'common-case' drivers could use.)
In some cases, the driver could even just leave in the page the packet
 prefix it got from the NIC, rather than reading it and then writing an
 interpreted version back, thus minimising the number of packet-page
 cachelines the 'bottom half' RX function has to touch (it would still
 need to write in anything it got from the RX event, of course).
It shouldn't be much work as many driver RX routines are already
 structured this way — sfc, for instance, has a split into efx_rx_packet()
 and __efx_rx_packet(), as a software pipeline for prefetching.

-Ed