netdev - Re: [net-next V7 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion and allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171013111347.1472fcfb@redhat.com>
Date:   Fri, 13 Oct 2017 11:13:47 +0200
From:   Jesper Dangaard Brouer <brouer@...hat.com>
To:     Edward Cree <ecree@...arflare.com>
Cc:     <netdev@...r.kernel.org>, <jakub.kicinski@...ronome.com>,
        "Michael S. Tsirkin" <mst@...hat.com>, <pavel.odintsov@...il.com>,
        Jason Wang <jasowang@...hat.com>, <mchan@...adcom.com>,
        John Fastabend <john.fastabend@...il.com>,
        <peter.waskiewicz.jr@...el.com>, <ast@...erby.dk>,
        Daniel Borkmann <borkmann@...earbox.net>,
        Alexei Starovoitov <alexei.starovoitov@...il.com>,
        Andy Gospodarek <andy@...yhouse.net>, brouer@...hat.com
Subject: Re: [net-next V7 PATCH 3/5] bpf: cpumap xdp_buff to skb conversion
 and allocation

On Thu, 12 Oct 2017 22:13:43 +0100
Edward Cree <ecree@...arflare.com> wrote:

> On 12/10/17 13:26, Jesper Dangaard Brouer wrote:
> > This patch makes cpumap functional, by adding SKB allocation and
> > invoking the network stack on the dequeuing CPU.
> >
> > For constructing the SKB on the remote CPU, the xdp_buff in converted
> > into a struct xdp_pkt, and it mapped into the top headroom of the
> > packet, to avoid allocating separate mem.  For now, struct xdp_pkt is
> > just a cpumap internal data structure, with info carried between
> > enqueue to dequeue.  
> 
> <snip>
> 
> > +struct sk_buff *cpu_map_build_skb(struct bpf_cpu_map_entry *rcpu,
> > +				  struct xdp_pkt *xdp_pkt)
> > +{
> > +	unsigned int frame_size;
> > +	void *pkt_data_start;
> > +	struct sk_buff *skb;
> > +
> > +	/* build_skb need to place skb_shared_info after SKB end, and
> > +	 * also want to know the memory "truesize".  Thus, need to
> > +	 * know the memory frame size backing xdp_buff.
> > +	 *
> > +	 * XDP was designed to have PAGE_SIZE frames, but this
> > +	 * assumption is not longer true with ixgbe and i40e.  It
> > +	 * would be preferred to set frame_size to 2048 or 4096
> > +	 * depending on the driver.
> > +	 *   frame_size = 2048;
> > +	 *   frame_len  = frame_size - sizeof(*xdp_pkt);
> > +	 *
> > +	 * Instead, with info avail, skb_shared_info in placed after
> > +	 * packet len.  This, unfortunately fakes the truesize.
> > +	 * Another disadvantage of this approach, the skb_shared_info
> > +	 * is not at a fixed memory location, with mixed length
> > +	 * packets, which is bad for cache-line hotness.
> > +	 */
> > +	frame_size = SKB_DATA_ALIGN(xdp_pkt->len) + xdp_pkt->headroom +
> > +		SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +
> > +	pkt_data_start = xdp_pkt->data - xdp_pkt->headroom;
> > +	skb = build_skb(pkt_data_start, frame_size);
> > +	if (!skb)
> > +		return NULL;
> > +
> > +	skb_reserve(skb, xdp_pkt->headroom);
> > +	__skb_put(skb, xdp_pkt->len);
> > +	if (xdp_pkt->metasize)
> > +		skb_metadata_set(skb, xdp_pkt->metasize);
> > +
> > +	/* Essential SKB info: protocol and skb->dev */
> > +	skb->protocol = eth_type_trans(skb, xdp_pkt->dev_rx);
> > +
> > +	/* Optional SKB info, currently missing:
> > +	 * - HW checksum info		(skb->ip_summed)
> > +	 * - HW RX hash			(skb_set_hash)
> > +	 * - RX ring dev queue index	(skb_record_rx_queue)
> > +	 */  
> One possibility for dealing with these and related issues — also things
>  like the proper way to free an xdp_buff if SKB creation fails, which
>  might not be page_frag_free() for some drivers with unusual recycle ring
>  implementations — is to have a new ndo for 'receiving' an xdp_pkt from a
>  cpumap redirect.
> Since you're always receiving from the same driver that enqueued it, even
>  the structure of the metadata stored in the top of the packet page
>  doesn't have to be standardised; instead, each driver can put there just
>  whatever happens to be needed for its ndo_xdp_rx routine.  (Though there
>  would probably be standard enqueue and dequeue functions that the
>  'common-case' drivers could use.)
> In some cases, the driver could even just leave in the page the packet
>  prefix it got from the NIC, rather than reading it and then writing an
>  interpreted version back, thus minimising the number of packet-page
>  cachelines the 'bottom half' RX function has to touch (it would still
>  need to write in anything it got from the RX event, of course).
> It shouldn't be much work as many driver RX routines are already
>  structured this way — sfc, for instance, has a split into efx_rx_packet()
>  and __efx_rx_packet(), as a software pipeline for prefetching.

This is all talking about future work.  I'll be happy to discuss this
outside/after this patchset.  I see nothing blocking these ideas.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer