netdev - Re: [PATCH net-next-2.6] net: introduce build

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110711105323.GB7532@rere.qmqm.pl>
Date:	Mon, 11 Jul 2011 12:53:24 +0200
From:	Michał Mirosław <mirq-linux@...e.qmqm.pl>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev@...r.kernel.org
Subject: Re: [PATCH net-next-2.6] net: introduce build_skb()

On Mon, Jul 11, 2011 at 07:46:46AM +0200, Eric Dumazet wrote:
> Le lundi 11 juillet 2011 à 02:52 +0200, Michał Mirosław a écrit :
> > Introduce __netdev_alloc_skb_aligned() to return skb with skb->data
> > aligned at specified 2^n multiple.
> > 
> > Signed-off-by: Michał Mirosław <mirq-linux@...e.qmqm.pl>
> Hi Michal
> 
> Could we synchronize our work to not introduce things that might
> disappear shortly ?

Sure. Are you saying that you'll convert all drivers to build_skb()? :-)

> Here is the RFC patch about build_skb() :
> 
> [PATCH] net: introduce build_skb()
> 
> One of the thing we discussed during netdev 2011 conference was the idea
> to change network drivers to allocate/populate their skb at RX
> completion time, right before feeding the skb to network stack.
> 
> Right now, we allocate skbs when populating the RX ring, and thats a
> waste of CPU cache, since allocating skb means a full memset() to clear
> the skb and its skb_shared_info portion. By the time NIC fills a frame
> in data buffer and host can get it, cpu probably threw away the cache
> lines from its caches, because of huge RX ring sizes.
> 
> So the deal would be to allocate only the data buffer for the NIC to
> populate its RX ring buffer. And use build_skb() at RX completion to
> attach a data buffer (now filled with an ethernet frame) to a new skb,
> initialize the skb_shared_info portion, and give the hot skb to network
> stack.
> 
> build_skb() is the function to allocate an skb, caller providing the
> data buffer that should be attached to it. Drivers are expected to call 
> skb_reserve() right after build_skb() to let skb->data points to the
> Ethernet frame (usually skipping NET_SKB_PAD and NET_IP_ALIGN)
[...]
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -234,6 +234,54 @@ nodata:
[...]
>  /**
> + * build_skb - build a network buffer
> + * @data: data buffer provider by caller
> + * @size: size of data buffer, not including skb_shared_info
> + *
> + * Allocate a new &sk_buff. Caller provides space holding head and
> + * skb_shared_info. Mostly used in driver RX path.
> + * The return is the buffer. On a failure the return is %NULL.
> + * Notes :
> + *  Before IO, driver allocates only data buffer where NIC put incoming frame
> + *  Driver SHOULD add room at head (NET_SKB_PAD) and
> + *  MUST add room tail (to hold skb_shared_info)
> + *  After IO, driver calls build_skb(), to get a hot skb instead of a cold one
> + *  before giving packet to stack. RX rings only contains data buffers, not
> + *  full skbs.
> + */
> +struct sk_buff *build_skb(void *data, unsigned int size)
> +{
> +	struct skb_shared_info *shinfo;
> +	struct sk_buff *skb;
> +
> +	skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
> +	if (!skb)
> +		return NULL;
> +
> +	size = SKB_DATA_ALIGN(size);
> +
> +	memset(skb, 0, offsetof(struct sk_buff, tail));
> +	skb->truesize = size + sizeof(struct sk_buff);
> +	atomic_set(&skb->users, 1);
> +	skb->head = data;
> +	skb->data = data;
> +	skb_reset_tail_pointer(skb);
> +	skb->end = skb->tail + size;
> +#ifdef NET_SKBUFF_DATA_USES_OFFSET
> +	skb->mac_header = ~0U;
> +#endif
> +
> +	/* make sure we initialize shinfo sequentially */
> +	shinfo = skb_shinfo(skb);
> +	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
> +	atomic_set(&shinfo->dataref, 1);
> +	kmemcheck_annotate_variable(shinfo->destructor_arg);
> +
> +	return skb;
> +}
> +EXPORT_SYMBOL(build_skb);

I like the idea. From driver writer perspective I would like to also
see a function that given max frame size and DMA aligment would allocate
the buffer for me.

In short:

 * rx_refill:

	[ptr, size] = alloc_rx_buffer(size, alignment, offset);
	[dma_addr] = map_buffer(ptr, size);
	append to rx buffer list

 * rx_poll:

	unmap_buffer(dma_addr, size);
	[skb] = build_skb(ptr, size);
	if (!skb)
		reuse_buffer
		return
	skb_reserve(skb, rx_offset);
	skb_put(skb, pkt_len);
	(indicate offloads)
	rx_skb(skb);
	call rx_refill

 * rx_poll with copybreak:

	sync_buffer_to_cpu(dma_addr, data_len);
	[skb, copied] = build_or_copy_skb(ptr, size, hw_rx_offset, pkt_len);
	if (copied || !skb)
		append to rx buffer list
	else
		unmap_buffer(dma_addr, size);
	if (!skb)
		return;
	(indicate offloads)
	rx_skb(skb);
	if (!copied)
		call rx_refill


For even less driver code this could happen:

 * rx_refill(ptr, dma_addr)
	[size/alignment stored in queue or net_device struct]

	if (is_rx_free_list_full)
		return -EBUSY;
	append to rx buffer list [ptr, dma_addr, size]
	return !is_rx_free_list_full;

 * rx_poll with copybreak:
	[copy threshold stored in queue or net_device struct]

	[skb] = finish_rx(ptr, dma_addr, size, hw_rx_offset, pkt_len);
	if (!skb)
		return;
	(fill in offloads: checksum/etc)
	rx_skb(skb);


This could be extended to handle frames spanning multiple buffers.

BTW, napi_get_frags() + napi_gro_frags() use similar idea of allocating
skb late.

Best Regards,
Michał Mirosław
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html