netdev - Re: [RFC] about "net: orphan frags on receive" insanity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Date:	Thu, 27 Jun 2013 10:09:10 +0300
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	netdev <netdev@...r.kernel.org>
Subject: Re: [RFC] about "net: orphan frags on receive" insanity

On Wed, Jun 26, 2013 at 12:44:34PM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-26 at 22:22 +0300, Michael S. Tsirkin wrote:
> 
> > The point is we don't know the final destination of the packet
> > until it's going through the stack.
> > 
> > We don't want to trigger a copy for all data we get from tun:
> > we only want to do this if the data has a chance to get
> > queued somewhere indefinitely.
> 
> I think you missed my point.
> 
> I am pretty sure it should be done from netif_rx(), not from
> __netif_receive_skb_core()
> so that modern NIC devices do not have to pay this extra cost.

Yres, this is exactly what I thought you meant, but as I said,
this approach disables tun zero copy.
The point of the current code is to copy only if there
are any host protocols that consume the skb.

> # size net/core/dev_*.o
>    text	   data	    bss	    dec	    hex	filename
>   41928	    963	    752	  43643	   aa7b	net/core/dev_before.o
>   41579	    963	    752	  43294	   a91e	net/core/dev_after.o

So we have a lot of code like this:
                if (pt_prev) {
                        ret = deliver_skb(skb, pt_prev, orig_dev);
                }
                pt_prev = NULL;
and this is what created all this.

But in practice pt_prev is set when we have multiple
consumers for a packet - this is the standard path:
        if (pt_prev) {
                if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
                        goto drop;
                else
                        ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
        }
So I think calls to deliver_skb
are not all that common - we should not make gcc inline
them so aggressively, let it make its own decisions.

Here's an alternative patch:


diff --git a/net/core/dev.c b/net/core/dev.c
index fc1e289..03cb51c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1642,9 +1642,9 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
 
-static inline int deliver_skb(struct sk_buff *skb,
-			      struct packet_type *pt_prev,
-			      struct net_device *orig_dev)
+static int deliver_skb(struct sk_buff *skb,
+		       struct packet_type *pt_prev,
+		       struct net_device *orig_dev)
 {
 	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
 		return -ENOMEM;


This gives us more than half the gain of your patch
without breaking tun zero copy.

[mst@...in linux]$ size net/core/dev_orig.o 
   text    data     bss     dec     hex filename
  47357    1270     720   49347    c0c3 net/core/dev_orig.o
[mst@...in linux]$ size net/core/dev.o
   text    data     bss     dec     hex filename
  47105    1270     720   49095    bfc7 net/core/dev.o

Maybe we should tag calls to if (pt_prev) before deliver_skb
as unlikely, this will move this code further from the hot path -
this needs some testing.



> Untested patch :
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fc1e289..3730318 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1646,8 +1646,6 @@ static inline int deliver_skb(struct sk_buff *skb,
>  			      struct packet_type *pt_prev,
>  			      struct net_device *orig_dev)
>  {
> -	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> -		return -ENOMEM;
>  	atomic_inc(&skb->users);
>  	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>  }
> @@ -3133,6 +3131,9 @@ int netif_rx(struct sk_buff *skb)

this is basically assuming tun will use netif_rx forever,
but I think it's quite possible that we'll
switch it to napi down the road.

>  	if (netpoll_rx(skb))
>  		return NET_RX_DROP;
>  
> +	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> +		return NET_RX_DROP;
> +
>  	net_timestamp_check(netdev_tstamp_prequeue, skb);
>  
>  	trace_netif_rx(skb);

And this chunk means all tun data is immediately copied before
it's passed to net core, in effect, no zero copy transmit.

> @@ -3498,10 +3499,7 @@ ncls:
>  	}
>  
>  	if (pt_prev) {
> -		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> -			goto drop;
> -		else
> -			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> +		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>  	} else {
>  drop:
>  		atomic_long_inc(&skb->dev->rx_dropped);
>


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html