netdev - Re: [patch net-next-2.6 V3] net: convert bonding to use rx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D610511.4050902@gmail.com>
Date:	Sun, 20 Feb 2011 13:12:01 +0100
From:	Nicolas de Pesloüan 
	<nicolas.2p.debian@...il.com>
To:	Jiri Pirko <jpirko@...hat.com>
CC:	Jay Vosburgh <fubar@...ibm.com>,
	David Miller <davem@...emloft.net>, kaber@...sh.net,
	eric.dumazet@...il.com, netdev@...r.kernel.org,
	shemminger@...ux-foundation.org, andy@...yhouse.net,
	"Fischer, Anna" <anna.fischer@...com>
Subject: Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler

Le 20/02/2011 11:36, Jiri Pirko a écrit :
> Sat, Feb 19, 2011 at 09:27:37PM CET, nicolas.2p.debian@...il.com wrote:
>> Le 19/02/2011 14:46, Jiri Pirko a écrit :
>>> Sat, Feb 19, 2011 at 02:18:00PM CET, nicolas.2p.debian@...il.com wrote:
>>>> Le 19/02/2011 12:28, Jiri Pirko a écrit :
>>>>> Sat, Feb 19, 2011 at 12:08:31PM CET, jpirko@...hat.com wrote:
>>>>>> Sat, Feb 19, 2011 at 11:56:23AM CET, nicolas.2p.debian@...il.com wrote:
>>>>>>> Le 19/02/2011 09:05, Jiri Pirko a écrit :
>>>>>>>> This patch converts bonding to use rx_handler. Results in cleaner
>>>>>>>> __netif_receive_skb() with much less exceptions needed. Also
>>>>>>>> bond-specific work is moved into bond code.
>>>>>>>>
>>>>>>>> Signed-off-by: Jiri Pirko<jpirko@...hat.com>
>>>>>>>>
>>>>>>>> v1->v2:
>>>>>>>>          using skb_iif instead of new input_dev to remember original
>>>>>>>> 	device
>>>>>>>> v2->v3:
>>>>>>>> 	set orig_dev = skb->dev if skb_iif is set
>>>>>>>>
>>>>>>>
>>>>>>> Why do we need to let the rx_handlers call netif_rx() or __netif_receive_skb()?
>>>>>>>
>>>>>>> Bonding used to be handled with very few overhead, simply replacing
>>>>>>> skb->dev with skb->dev->master. Time has passed and we eventually
>>>>>>> added many special processing for bonding into __netif_receive_skb(),
>>>>>>> but the overhead remained very light.
>>>>>>>
>>>>>>> Calling netif_rx() (or __netif_receive_skb()) to allow nesting would probably lead to some overhead.
>>>>>>>
>>>>>>> Can't we, instead, loop inside __netif_receive_skb(), and deliver
>>>>>>> whatever need to be delivered, to whoever need, inside the loop ?
>>>>>>>
>>>>>>> rx_handler = rcu_dereference(skb->dev->rx_handler);
>>>>>>> while (rx_handler) {
>>>>>>> 	/* ...  */
>>>>>>> 	orig_dev = skb->dev;
>>>>>>> 	skb = rx_handler(skb);
>>>>>>> 	/* ... */
>>>>>>> 	rx_handler = (skb->dev != orig_dev) ? rcu_dereference(skb->dev->rx_handler) : NULL;
>>>>>>> }
>>>>>>>
>>>>>>> This would reduce the overhead, while still allowing nesting: vlan on
>>>>>>> top on bonding, bridge on top on bonding, ...
>>>>>>
>>>>>> I see your point. Makes sense to me. But the loop would have to include
>>>>>> at least processing of ptype_all too. I'm going to cook a follow-up
>>>>>> patch.
>>>>>>
>>>>>
>>>>> DRAFT (doesn't modify rx_handlers):
>>>>>
>>>>> diff --git a/net/core/dev.c b/net/core/dev.c
>>>>> index 4ebf7fe..e5dba47 100644
>>>>> --- a/net/core/dev.c
>>>>> +++ b/net/core/dev.c
>>>>> @@ -3115,6 +3115,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>>   {
>>>>>   	struct packet_type *ptype, *pt_prev;
>>>>>   	rx_handler_func_t *rx_handler;
>>>>> +	struct net_device *dev;
>>>>>   	struct net_device *orig_dev;
>>>>>   	struct net_device *null_or_dev;
>>>>>   	int ret = NET_RX_DROP;
>>>>> @@ -3129,7 +3130,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>>   	if (netpoll_receive_skb(skb))
>>>>>   		return NET_RX_DROP;
>>>>>
>>>>> -	__this_cpu_inc(softnet_data.processed);
>>>>> +	skb->skb_iif = skb->dev->ifindex;
>>>>> +	orig_dev = skb->dev;
>>>>
>>>> orig_dev should be set inside the loop, to reflect "previously
>>>> crossed device", while following the path:
>>>>
>>>> eth0 ->   bond0 ->   br0.
>>>>
>>>> First step inside loop:
>>>>
>>>> orig_dev = eth0
>>>> skb->dev = bond0 (at the end of the loop).
>>>>
>>>> Second step inside loop:
>>>>
>>>> orig_dev = bond0
>>>> skb->dev = br0 (et the end of the loop).
>>>>
>>>> This would allow for exact match delivery to bond0 if someone bind there.
>>>>
>>>>> +
>>>>>   	skb_reset_network_header(skb);
>>>>>   	skb_reset_transport_header(skb);
>>>>>   	skb->mac_len = skb->network_header - skb->mac_header;
>>>>> @@ -3138,12 +3141,9 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>>
>>>>>   	rcu_read_lock();
>>>>>
>>>>> -	if (!skb->skb_iif) {
>>>>> -		skb->skb_iif = skb->dev->ifindex;
>>>>> -		orig_dev = skb->dev;
>>>>> -	} else {
>>>>> -		orig_dev = dev_get_by_index_rcu(dev_net(skb->dev), skb->skb_iif);
>>>>> -	}
>>>>
>>>> I like the fact that it removes the above part.
>>>>
>>>>> +another_round:
>>>>> +	__this_cpu_inc(softnet_data.processed);
>>>>> +	dev = skb->dev;
>>>>>
>>>>>   #ifdef CONFIG_NET_CLS_ACT
>>>>>   	if (skb->tc_verd&    TC_NCLS) {
>>>>> @@ -3153,7 +3153,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>>   #endif
>>>>>
>>>>>   	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>>> -		if (!ptype->dev || ptype->dev == skb->dev) {
>>>>> +		if (!ptype->dev || ptype->dev == dev) {
>>>>>   			if (pt_prev)
>>>>>   				ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>>   			pt_prev = ptype;
>>>>
>>>> Inside the loop, we should only do exact match delivery, for
>>>> &ptype_all and for&ptype_base[ntohs(type)&   PTYPE_HASH_MASK]:
>>>>
>>>>         list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>> -               if (!ptype->dev || ptype->dev == dev) {
>>>> +               if (ptype->dev == dev) {
>>>>                         if (pt_prev)
>>>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>                         pt_prev = ptype;
>>>>                 }
>>>>         }
>>>>
>>>>
>>>>         list_for_each_entry_rcu(ptype,
>>>>                         &ptype_base[ntohs(type)&   PTYPE_HASH_MASK], list) {
>>>>                 if (ptype->type == type&&
>>>> -                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>>>> +                   (ptype->dev == skb->dev)) {
>>>>                         if (pt_prev)
>>>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>                         pt_prev = ptype;
>>>>                 }
>>>>         }
>>>>
>>>> After leaving the loop, we can do wilcard delivery, if skb is not NULL.
>>>>
>>>>         list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>> -               if (!ptype->dev || ptype->dev == dev) {
>>>> +               if (!ptype->dev) {
>>>>                         if (pt_prev)
>>>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>                         pt_prev = ptype;
>>>>                }
>>>>         }
>>>>
>>>>
>>>>         list_for_each_entry_rcu(ptype,
>>>>                         &ptype_base[ntohs(type)&   PTYPE_HASH_MASK], list) {
>>>> -               if (ptype->type == type&&
>>>> -                   (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>>>> +		if (ptype->type == type&&   !ptype->dev) {
>>>>                         if (pt_prev)
>>>>                                 ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>                         pt_prev = ptype;
>>>>                 }
>>>>         }
>>>>
>>>> This would reduce the number of tests inside the
>>>> list_for_each_entry_rcu() loops. And because we match only ptype->dev
>>>> == dev inside the loop and !ptype->dev outside the loop, this should
>>>> avoid duplicate delivery.
>>>
>>> Would you care to put this into patch so I can see the whole picture?
>>> Thanks.
>>
>> Here is what I have in mind. It is based on your previous DRAFT patch, and don't modify rx_handlers yet.
>>
>> Only compile tested !!
>>
>> I don't know if every pieces are at the right place. I wonder what to
>> do with CONFIG_NET_CLS_ACT part, that currently is between ptype_all
>> and ptype_base processing.
>>
>> Anyway, the general idea is there.
>>
>> 	Nicolas.
>>
>> net/core/dev.c |   70 ++++++++++++++++++++++++++++++++++++++++++++++++--------
>> 1 files changed, 60 insertions(+), 10 deletions(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index e5dba47..7e007a9 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -3117,7 +3117,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> 	rx_handler_func_t *rx_handler;
>> 	struct net_device *dev;
>> 	struct net_device *orig_dev;
>> -	struct net_device *null_or_dev;
>> 	int ret = NET_RX_DROP;
>> 	__be16 type;
>>
>> @@ -3130,9 +3129,6 @@ static int __netif_receive_skb(struct sk_buff *skb)
>> 	if (netpoll_receive_skb(skb))
>> 		return NET_RX_DROP;
>>
>> -	skb->skb_iif = skb->dev->ifindex;
>> -	orig_dev = skb->dev;
>> -
>> 	skb_reset_network_header(skb);
>> 	skb_reset_transport_header(skb);
>> 	skb->mac_len = skb->network_header - skb->mac_header;
>> @@ -3143,6 +3139,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>
>> another_round:
>> 	__this_cpu_inc(softnet_data.processed);
>> +	skb->skb_iif = skb->dev->ifindex;
>> +	orig_dev = skb->dev;
> orig_dev should be set at the end of the loop. Now you are going to have
> it always the same as dev and skb->dev.
>

Yes, you are right.

I thinking about all this, I wonder what the protocol handlers expect as the orig_dev value ?

Lest imagine the following configuration: eth0 -> bond0 -> br0.

What does a protocol handler listening on br0 expect for orig_dev ? bond0 or eth0 ? Current 
implementation give eth0, but I think bond0 should be the right value, for proper nesting.

>> 	dev = skb->dev;
>>
>> #ifdef CONFIG_NET_CLS_ACT
>> @@ -3152,8 +3150,13 @@ another_round:
>> 	}
>> #endif
>>
>> +	/*
>> +	 * Deliver to ptype_all protocol handlers that match current dev.
>> +	 * This happens before rx_handler is given a chance to change skb->dev.
>> +	 */
>> +
>> 	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> -		if (!ptype->dev || ptype->dev == dev) {
>> +		if (ptype->dev == dev) {
>> 			if (pt_prev)
>> 				ret = deliver_skb(skb, pt_prev, orig_dev);
>> 			pt_prev = ptype;
>> @@ -3167,6 +3170,31 @@ another_round:
>> ncls:
>> #endif
>>
>> +	/*
>> +	 * Deliver to ptype_base protocol handlers that match current dev.
>> +	 * This happens before rx_handler is given a chance to change skb->dev.
>> +	 */
>> +
>> +	type = skb->protocol;
>> +	list_for_each_entry_rcu(ptype,
>> +			&ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>> +		if (ptype->type == type&&  ptype->dev == skb->dev) {
>> +			if (pt_prev)
>> +				ret = deliver_skb(skb, pt_prev, orig_dev);
>> +			pt_prev = ptype;
>> +		}
>> +	}
>
> I'm not sure it is ok to deliver ptype_base here. See comment above
> ptype_head() (I'm not sure I understand that correctly)

Anyway, all this is probably plain wrong: Delivering the skb to protocol handlers while still 
changing the skb is guaranteed to cause strange behaviors.

If we want to be able to deliver the skb to different protocol handlers and give all of them the 
right values for dev->skb and orig_dev (or previous_dev), we might end up with copying the skb. I 
hate the idea, but currently can't find a cleaner way to do so.

We first need to clarify what orig_dev should be, as stated above.

>> +
>> +	/*
>> +	 * Call rx_handler for current device.
>> +	 * If rx_handler return NULL, skip wilcard protocol handler delivery.
>> +	 * Else, if skb->dev changed, restart the whole delivery process, to
>> +	 * allow for device nesting.
>> +	 *
>> +	 * Warning:
>> +	 * rx_handlers must kfree_skb(skb) if they return NULL.
> Well this is not true. They can return NULL and call netif_rx as they
> have before. No changes necessary I believe.

I don't really know. This needs to be double checked, anyway.

>> +	 */
>> +
>> 	rx_handler = rcu_dereference(dev->rx_handler);
>> 	if (rx_handler) {
>> 		if (pt_prev) {
>> @@ -3176,10 +3204,15 @@ ncls:
>> 		skb = rx_handler(skb);
>> 		if (!skb)
>> 			goto out;
>> -		if (dev != skb->dev)
>> +		if (skb->dev != dev)
>> 			goto another_round;
>> 	}
>>
>> +	/*
>> +	 * FIXME: The part below should use rx_handler instead of being hard
>> +	 * coded here.
> I'm not sure it is doable atm. For bridge and bond it should not be a
> problem, but for macvlan, there is possible to have macvlans and vlans
> on the same dev. This possibility should persist.
> /me scratches head on the idea to have multiple rx_handlers although it
> was his original idea....

I think your original proposal of having several rx_handlers per device was right.

At the time you introduced the rx_handler system, only bridge and macvlan used it. Even if using 
bridge and macvlan on the same base device might be useless, this is not true for every possible 
rx_handler configuration.

Now that we want to move bonding and vlan to the rx_handler system, it becomes obvious that we need 
several rx_handlers per device. At least, vlan should properly mix with bridge. And who know what 
would be the fifth rx_handler...

>> +	 */
>> +
>> 	if (vlan_tx_tag_present(skb)) {
>> 		if (pt_prev) {
>> 			ret = deliver_skb(skb, pt_prev, orig_dev);
>> @@ -3192,16 +3225,33 @@ ncls:
>> 			goto out;
>> 	}
>>
>> +	/*
>> +	 * FIXME: Can't this be moved into the rx_handler for bonding,
>> +	 * or into a futur rx_handler for vlan?
> This hook is something I do not like at all :/ But anyway if should be in vlan
> part I think.

Yes, and in order for the future rx_handler for vlan to properly handle it, it needs to know the 
device just below it, not the pure original device. Hence, my question about the exact meaning of 
orig_dev...

	Nicolas.

>> +	 */
>> +
>> 	vlan_on_bond_hook(skb);
>>
>> -	/* deliver only exact match when indicated */
>> -	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
>> +	/*
>> +	 * Deliver to wildcard ptype_all protocol handlers.
>> +	 */
>> +
>> +	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>> +		if (!ptype->dev) {
>> +			if (pt_prev)
>> +				ret = deliver_skb(skb, pt_prev, orig_dev);
>> +			pt_prev = ptype;
>> +		}
>> +	}
>> +
>> +	/*
>> +	 * Deliver to wildcard ptype_all protocol handlers.
>> +	 */
>>
>> 	type = skb->protocol;
>> 	list_for_each_entry_rcu(ptype,
>> 			&ptype_base[ntohs(type)&  PTYPE_HASH_MASK], list) {
>> -		if (ptype->type == type&&
>> -		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
>> +		if (ptype->type == type&&  !ptype->dev) {
>> 			if (pt_prev)
>> 				ret = deliver_skb(skb, pt_prev, orig_dev);
>> 			pt_prev = ptype;
>> --
>> 1.7.2.3
>>
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html