netdev - Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FEAA149.5010306@redhat.com>
Date:	Wed, 27 Jun 2012 13:59:37 +0800
From:	Jason Wang <jasowang@...hat.com>
To:	"Michael S. Tsirkin" <mst@...hat.com>
CC:	habanero@...ux.vnet.ibm.com, netdev@...r.kernel.org,
	linux-kernel@...r.kernel.org, krkumar2@...ibm.com,
	tahm@...ux.vnet.ibm.com, akong@...hat.com, davem@...emloft.net,
	shemminger@...tta.com, mashirle@...ibm.com,
	Eric Dumazet <eric.dumazet@...il.com>
Subject: Re: [net-next RFC V3 PATCH 4/6] tuntap: multiqueue support

On 06/26/2012 07:54 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 26, 2012 at 01:52:57PM +0800, Jason Wang wrote:
>> On 06/25/2012 04:25 PM, Michael S. Tsirkin wrote:
>>> On Mon, Jun 25, 2012 at 02:10:18PM +0800, Jason Wang wrote:
>>>> This patch adds multiqueue support for tap device. This is done by abstracting
>>>> each queue as a file/socket and allowing multiple sockets to be attached to the
>>>> tuntap device (an array of tun_file were stored in the tun_struct). Userspace
>>>> could write and read from those files to do the parallel packet
>>>> sending/receiving.
>>>>
>>>> Unlike the previous single queue implementation, the socket and device were
>>>> loosely coupled, each of them were allowed to go away first. In order to let the
>>>> tx path lockless, netif_tx_loch_bh() is replaced by RCU/NETIF_F_LLTX to
>>>> synchronize between data path and system call.
>>> Don't use LLTX/RCU. It's not worth it.
>>> Use something like netif_set_real_num_tx_queues.
>>>
>> For LLTX, maybe it's better to convert it to alloc_netdev_mq() to
>> let the kernel see all queues and make the queue stopping and
>> per-queue stats eaiser.
>> RCU is used to handle the attaching/detaching when tun/tap is
>> sending and receiving packets which looks reasonalbe for me.
> Yes but do we have to allow this? How about we always ask
> userspace to attach to all active queues?

Attaching/detaching is a method to active/deactive a queue, if all 
queues were kept attached, then we need other method or flag to mark the 
queue as activateddeactived and still need to synchronize with data path.
>> Not
>> sure netif_set_real_num_tx_queues() can help in this situation.
> Check it out.
>
>>>> The tx queue selecting is first based on the recorded rxq index of an skb, it
>>>> there's no such one, then choosing based on rx hashing (skb_get_rxhash()).
>>>>
>>>> Signed-off-by: Jason Wang<jasowang@...hat.com>
>>> Interestingly macvtap switched to hashing first:
>>> ef0002b577b52941fb147128f30bd1ecfdd3ff6d
>>> (the commit log is corrupted but see what it
>>> does in the patch).
>>> Any idea why?
>>>
>>>> ---
>>>>   drivers/net/tun.c |  371 +++++++++++++++++++++++++++++++++--------------------
>>>>   1 files changed, 232 insertions(+), 139 deletions(-)
>>>>
>>>> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
>>>> index 8233b0a..5c26757 100644
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -107,6 +107,8 @@ struct tap_filter {
>>>>   	unsigned char	addr[FLT_EXACT_COUNT][ETH_ALEN];
>>>>   };
>>>>
>>>> +#define MAX_TAP_QUEUES (NR_CPUS<   16 ? NR_CPUS : 16)
>>> Why the limit? I am guessing you copied this from macvtap?
>>> This is problematic for a number of reasons:
>>> 	- will not play well with migration
>>> 	- will not work well for a large guest
>>>
>>> Yes, macvtap needs to be fixed too.
>>>
>>> I am guessing what it is trying to prevent is queueing
>>> up a huge number of packets?
>>> So just divide the default tx queue limit by the # of queues.
>> Not sure,
>> another reasons I can guess:
>> - to prevent storing a large array of pointers in tun_struct or macvlan_dev.
> OK so with the limit of e.g. 1024 we'd allocate at most
> 2 pages of memory. This doesn't look too bad. 1024 is probably a
> high enough limit: modern hypervisors seem to support on the order
> of 100-200 CPUs so this leaves us some breathing space
> if we want to match a queue per guest CPU.
> Of course we need to limit the packets per queue
> in such a setup more aggressively. 1000 packets * 1000 queues
> * 64K per packet is too much.
>
>> - it may not be suitable to allow the number of virtqueues greater
>> than the number of physical queues in the card
> Maybe for macvtap, here we have no idea which card we
> are working with and how many queues it has.
>
>>> And by the way, for MQ applications maybe we can finally
>>> ignore tx queue altogether and limit the total number
>>> of bytes queued?
>>> To avoid regressions we can make it large like 64M/# queues.
>>> Could be a separate patch I think, and for a single queue
>>> might need a compatible mode though I am not sure.
>> Could you explain more about this?
>> Did you mean to have a total
>> sndbuf for all sockets that attached to tun/tap?
> Consider that we currently limit the # of
> packets queued at tun for xmit to userspace.
> Some limit is needed but # of packets sounds
> very silly - limiting the total memory
> might be more reasonable.
>
> In case of multiqueue, we really care about
> total # of packets or total memory, but a simple
> approximation could be to divide the allocation
> between active queues equally.

A possible method is to divce the TUN_READQ_SIZE by #queues, but make it 
at least to be equal to the vring size (256).
>
> qdisc also queues some packets, that logic is
> using # of packets anyway. So either make that
> 1000/# queues, or even set to 0 as Eric once
> suggested.
>
>>>> +
>>>>   struct tun_file {
>>>>   	struct sock sk;
>>>>   	struct socket socket;
>>>> @@ -114,16 +116,18 @@ struct tun_file {
>>>>   	int vnet_hdr_sz;
>>>>   	struct tap_filter txflt;
>>>>   	atomic_t count;
>>>> -	struct tun_struct *tun;
>>>> +	struct tun_struct __rcu *tun;
>>>>   	struct net *net;
>>>>   	struct fasync_struct *fasync;
>>>>   	unsigned int flags;
>>>> +	u16 queue_index;
>>>>   };
>>>>
>>>>   struct tun_sock;
>>>>
>>>>   struct tun_struct {
>>>> -	struct tun_file		*tfile;
>>>> +	struct tun_file		*tfiles[MAX_TAP_QUEUES];
>>>> +	unsigned int            numqueues;
>>>>   	unsigned int 		flags;
>>>>   	uid_t			owner;
>>>>   	gid_t			group;
>>>> @@ -138,80 +142,159 @@ struct tun_struct {
>>>>   #endif
>>>>   };
>>>>
>>>> -static int tun_attach(struct tun_struct *tun, struct file *file)
>>>> +static DEFINE_SPINLOCK(tun_lock);
>>>> +
>>>> +/*
>>>> + * tun_get_queue(): calculate the queue index
>>>> + *     - if skbs comes from mq nics, we can just borrow
>>>> + *     - if not, calculate from the hash
>>>> + */
>>>> +static struct tun_file *tun_get_queue(struct net_device *dev,
>>>> +				      struct sk_buff *skb)
>>>>   {
>>>> -	struct tun_file *tfile = file->private_data;
>>>> -	int err;
>>>> +	struct tun_struct *tun = netdev_priv(dev);
>>>> +	struct tun_file *tfile = NULL;
>>>> +	int numqueues = tun->numqueues;
>>>> +	__u32 rxq;
>>>>
>>>> -	ASSERT_RTNL();
>>>> +	BUG_ON(!rcu_read_lock_held());
>>>>
>>>> -	netif_tx_lock_bh(tun->dev);
>>>> +	if (!numqueues)
>>>> +		goto out;
>>>>
>>>> -	err = -EINVAL;
>>>> -	if (tfile->tun)
>>>> +	if (numqueues == 1) {
>>>> +		tfile = rcu_dereference(tun->tfiles[0]);
>>> Instead of hacks like this, you can ask for an MQ
>>> flag to be set in SETIFF. Then you won't need to
>>> handle attach/detach at random times.
>> Consier user switch between a sq guest to mq guest, qemu would
>> attach or detach the fd which could not be expceted in kernel.
> Can't userspace keep it attached always, just deactivate MQ?
>
>>> And most of the scary num_queues checks can go away.
>> Even we has a MQ flag, userspace could still just attach one queue
>> to the device.
> I think we allow too much flexibility if we let
> userspace detach a random queue.

The point is to let tun/tap has the same flexibility as macvtap. Macvtap 
allows add/delete queues at any time and it's very easy to add 
detach/attach to macvtap. So we can easily use almost the same ioctls to 
active/deactive a queue at any time for both tap and macvtap.
> Maybe only allow attaching/detaching with MQ off?
> If userspace wants to attach/detach, clear MQ first?

Maybe I didn't understand the point here but I didn't advantages except 
more times of ioctl().
> Alternatively, attach/detach all queues in one ioctl?

Yes, it can be same one.
>
>>> You can then also ask userspace about the max # of queues
>>> to expect if you want to save some memory.
>>>
>> Yes, good suggestion.
>>>>   		goto out;
>>>> +	}
>>>>
>>>> -	err = -EBUSY;
>>>> -	if (tun->tfile)
>>>> +	if (likely(skb_rx_queue_recorded(skb))) {
>>>> +		rxq = skb_get_rx_queue(skb);
>>>> +
>>>> +		while (unlikely(rxq>= numqueues))
>>>> +			rxq -= numqueues;
>>>> +
>>>> +		tfile = rcu_dereference(tun->tfiles[rxq]);
>>>>   		goto out;
>>>> +	}
>>>>
>>>> -	err = 0;
>>>> -	tfile->tun = tun;
>>>> -	tun->tfile = tfile;
>>>> -	netif_carrier_on(tun->dev);
>>>> -	dev_hold(tun->dev);
>>>> -	sock_hold(&tfile->sk);
>>>> -	atomic_inc(&tfile->count);
>>>> +	/* Check if we can use flow to select a queue */
>>>> +	rxq = skb_get_rxhash(skb);
>>>> +	if (rxq) {
>>>> +		u32 idx = ((u64)rxq * numqueues)>>   32;
>>> This completely confuses me. What's the logic here?
>>> How do we even know it's in range?
>>>
>> rxq is a u32, so the result should be less than numqueues.
> Aha. So the point is to use multiply+shift instead of %?
> Please add a comment.
>

Yes sure.
>>>> +		tfile = rcu_dereference(tun->tfiles[idx]);
>>>> +		goto out;
>>>> +	}
>>>>
>>>> +	tfile = rcu_dereference(tun->tfiles[0]);
>>>>   out:
>>>> -	netif_tx_unlock_bh(tun->dev);
>>>> -	return err;
>>>> +	return tfile;
>>>>   }
>>>>
>>>> -static void __tun_detach(struct tun_struct *tun)
>>>> +static int tun_detach(struct tun_file *tfile, bool clean)
>>>>   {
>>>> -	struct tun_file *tfile = tun->tfile;
>>>> -	/* Detach from net device */
>>>> -	netif_tx_lock_bh(tun->dev);
>>>> -	netif_carrier_off(tun->dev);
>>>> -	tun->tfile = NULL;
>>>> -	netif_tx_unlock_bh(tun->dev);
>>>> -
>>>> -	/* Drop read queue */
>>>> -	skb_queue_purge(&tfile->socket.sk->sk_receive_queue);
>>>> -
>>>> -	/* Drop the extra count on the net device */
>>>> -	dev_put(tun->dev);
>>>> -}
>>>> +	struct tun_struct *tun;
>>>> +	struct net_device *dev = NULL;
>>>> +	bool destroy = false;
>>>>
>>>> -static void tun_detach(struct tun_struct *tun)
>>>> -{
>>>> -	rtnl_lock();
>>>> -	__tun_detach(tun);
>>>> -	rtnl_unlock();
>>>> -}
>>>> +	spin_lock(&tun_lock);
>>>>
>>>> -static struct tun_struct *__tun_get(struct tun_file *tfile)
>>>> -{
>>>> -	struct tun_struct *tun = NULL;
>>>> +	tun = rcu_dereference_protected(tfile->tun,
>>>> +					lockdep_is_held(&tun_lock));
>>>> +	if (tun) {
>>>> +		u16 index = tfile->queue_index;
>>>> +		BUG_ON(index>= tun->numqueues);
>>>> +		dev = tun->dev;
>>>> +
>>>> +		rcu_assign_pointer(tun->tfiles[index],
>>>> +				   tun->tfiles[tun->numqueues - 1]);
>>>> +		tun->tfiles[index]->queue_index = index;
>>>> +		rcu_assign_pointer(tfile->tun, NULL);
>>>> +		--tun->numqueues;
>>>> +		sock_put(&tfile->sk);
>>>>
>>>> -	if (atomic_inc_not_zero(&tfile->count))
>>>> -		tun = tfile->tun;
>>>> +		if (tun->numqueues == 0&&   !(tun->flags&   TUN_PERSIST))
>>>> +			destroy = true;
>>> Please don't use flags like that. Use dedicated labels and goto there on error.
>> ok.
>>>> +	}
>>>>
>>>> -	return tun;
>>>> +	spin_unlock(&tun_lock);
>>>> +
>>>> +	synchronize_rcu();
>>>> +	if (clean)
>>>> +		sock_put(&tfile->sk);
>>>> +
>>>> +	if (destroy) {
>>>> +		rtnl_lock();
>>>> +		if (dev->reg_state == NETREG_REGISTERED)
>>>> +			unregister_netdevice(dev);
>>>> +		rtnl_unlock();
>>>> +	}
>>>> +
>>>> +	return 0;
>>>>   }
>>>>
>>>> -static struct tun_struct *tun_get(struct file *file)
>>>> +static void tun_detach_all(struct net_device *dev)
>>>>   {
>>>> -	return __tun_get(file->private_data);
>>>> +	struct tun_struct *tun = netdev_priv(dev);
>>>> +	struct tun_file *tfile, *tfile_list[MAX_TAP_QUEUES];
>>>> +	int i, j = 0;
>>>> +
>>>> +	spin_lock(&tun_lock);
>>>> +
>>>> +	for (i = 0; i<   MAX_TAP_QUEUES&&   tun->numqueues; i++) {
>>>> +		tfile = rcu_dereference_protected(tun->tfiles[i],
>>>> +						lockdep_is_held(&tun_lock));
>>>> +		BUG_ON(!tfile);
>>>> +		wake_up_all(&tfile->wq.wait);
>>>> +		tfile_list[j++] = tfile;
>>>> +		rcu_assign_pointer(tfile->tun, NULL);
>>>> +		--tun->numqueues;
>>>> +	}
>>>> +	BUG_ON(tun->numqueues != 0);
>>>> +	/* guarantee that any future tun_attach will fail */
>>>> +	tun->numqueues = MAX_TAP_QUEUES;
>>>> +	spin_unlock(&tun_lock);
>>>> +
>>>> +	synchronize_rcu();
>>>> +	for (--j; j>= 0; j--)
>>>> +		sock_put(&tfile_list[j]->sk);
>>>>   }
>>>>
>>>> -static void tun_put(struct tun_struct *tun)
>>>> +static int tun_attach(struct tun_struct *tun, struct file *file)
>>>>   {
>>>> -	struct tun_file *tfile = tun->tfile;
>>>> +	struct tun_file *tfile = file->private_data;
>>>> +	int err;
>>>> +
>>>> +	ASSERT_RTNL();
>>>> +
>>>> +	spin_lock(&tun_lock);
>>>>
>>>> -	if (atomic_dec_and_test(&tfile->count))
>>>> -		tun_detach(tfile->tun);
>>>> +	err = -EINVAL;
>>>> +	if (rcu_dereference_protected(tfile->tun, lockdep_is_held(&tun_lock)))
>>>> +		goto out;
>>>> +
>>>> +	err = -EBUSY;
>>>> +	if (!(tun->flags&   TUN_TAP_MQ)&&   tun->numqueues == 1)
>>>> +		goto out;
>>>> +
>>>> +	if (tun->numqueues == MAX_TAP_QUEUES)
>>>> +		goto out;
>>>> +
>>>> +	err = 0;
>>>> +	tfile->queue_index = tun->numqueues;
>>>> +	rcu_assign_pointer(tfile->tun, tun);
>>>> +	rcu_assign_pointer(tun->tfiles[tun->numqueues], tfile);
>>>> +	sock_hold(&tfile->sk);
>>>> +	tun->numqueues++;
>>>> +
>>>> +	if (tun->numqueues == 1)
>>>> +		netif_carrier_on(tun->dev);
>>>> +
>>>> +	/* device is allowed to go away first, so no need to hold extra
>>>> +	 * refcnt. */
>>>> +
>>>> +out:
>>>> +	spin_unlock(&tun_lock);
>>>> +	return err;
>>>>   }
>>>>
>>>>   /* TAP filtering */
>>>> @@ -331,16 +414,7 @@ static const struct ethtool_ops tun_ethtool_ops;
>>>>   /* Net device detach from fd. */
>>>>   static void tun_net_uninit(struct net_device *dev)
>>>>   {
>>>> -	struct tun_struct *tun = netdev_priv(dev);
>>>> -	struct tun_file *tfile = tun->tfile;
>>>> -
>>>> -	/* Inform the methods they need to stop using the dev.
>>>> -	 */
>>>> -	if (tfile) {
>>>> -		wake_up_all(&tfile->wq.wait);
>>>> -		if (atomic_dec_and_test(&tfile->count))
>>>> -			__tun_detach(tun);
>>>> -	}
>>>> +	tun_detach_all(dev);
>>>>   }
>>>>
>>>>   /* Net device open. */
>>>> @@ -360,10 +434,10 @@ static int tun_net_close(struct net_device *dev)
>>>>   /* Net device start xmit */
>>>>   static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>   {
>>>> -	struct tun_struct *tun = netdev_priv(dev);
>>>> -	struct tun_file *tfile = tun->tfile;
>>>> +	struct tun_file *tfile = NULL;
>>>>
>>>> -	tun_debug(KERN_INFO, tun, "tun_net_xmit %d\n", skb->len);
>>>> +	rcu_read_lock();
>>>> +	tfile = tun_get_queue(dev, skb);
>>>>
>>>>   	/* Drop packet if interface is not attached */
>>>>   	if (!tfile)
>>>> @@ -381,7 +455,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>
>>>>   	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
>>>>   	>= dev->tx_queue_len) {
>>>> -		if (!(tun->flags&   TUN_ONE_QUEUE)) {
>>>> +		if (!(tfile->flags&   TUN_ONE_QUEUE)&&
>>> Which patch moved flags from tun to tfile?
>> Patch 1 cache the tun->flags in tfile, but it seems this may let the
>> flags out of sync. So we'd better to use the one in tun_struct.
>>>> +		    !(tfile->flags&   TUN_TAP_MQ)) {
>>>>   			/* Normal queueing mode. */
>>>>   			/* Packet scheduler handles dropping of further packets. */
>>>>   			netif_stop_queue(dev);
>>>> @@ -390,7 +465,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>   			 * error is more appropriate. */
>>>>   			dev->stats.tx_fifo_errors++;
>>>>   		} else {
>>>> -			/* Single queue mode.
>>>> +			/* Single queue mode or multi queue mode.
>>>>   			 * Driver handles dropping of all packets itself. */
>>> Please don't do this. Stop the queue on overrun as appropriate.
>>> ONE_QUEUE is a legacy hack.
>>>
>>> BTW we really should stop queue before we start dropping packets,
>>> but that can be a separate patch.
>> The problem here is the using of NETIF_F_LLTX. Kernel could only see
>> one queue even for a multiqueue tun/tap. If we use
>> netif_stop_queue(), all other queues would be stopped also.
> Another reason not to use LLTX?

Yes.
>>>>   			goto drop;
>>>>   		}
>>>> @@ -408,9 +483,11 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>>>>   		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>>>   	wake_up_interruptible_poll(&tfile->wq.wait, POLLIN |
>>>>   				   POLLRDNORM | POLLRDBAND);
>>>> +	rcu_read_unlock();
>>>>   	return NETDEV_TX_OK;
>>>>
>>>>   drop:
>>>> +	rcu_read_unlock();
>>>>   	dev->stats.tx_dropped++;
>>>>   	kfree_skb(skb);
>>>>   	return NETDEV_TX_OK;
>>>> @@ -527,16 +604,22 @@ static void tun_net_init(struct net_device *dev)
>>>>   static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>>>   {
>>>>   	struct tun_file *tfile = file->private_data;
>>>> -	struct tun_struct *tun = __tun_get(tfile);
>>>> +	struct tun_struct *tun = NULL;
>>>>   	struct sock *sk;
>>>>   	unsigned int mask = 0;
>>>>
>>>> -	if (!tun)
>>>> +	if (!tfile)
>>>>   		return POLLERR;
>>>>
>>>> -	sk = tfile->socket.sk;
>>>> +	rcu_read_lock();
>>>> +	tun = rcu_dereference(tfile->tun);
>>>> +	if (!tun) {
>>>> +		rcu_read_unlock();
>>>> +		return POLLERR;
>>>> +	}
>>>> +	rcu_read_unlock();
>>>>
>>>> -	tun_debug(KERN_INFO, tun, "tun_chr_poll\n");
>>>> +	sk =&tfile->sk;
>>>>
>>>>   	poll_wait(file,&tfile->wq.wait, wait);
>>>>
>>>> @@ -548,10 +631,12 @@ static unsigned int tun_chr_poll(struct file *file, poll_table * wait)
>>>>   	     sock_writeable(sk)))
>>>>   		mask |= POLLOUT | POLLWRNORM;
>>>>
>>>> -	if (tun->dev->reg_state != NETREG_REGISTERED)
>>>> +	rcu_read_lock();
>>>> +	tun = rcu_dereference(tfile->tun);
>>>> +	if (!tun || tun->dev->reg_state != NETREG_REGISTERED)
>>>>   		mask = POLLERR;
>>>> +	rcu_read_unlock();
>>>>
>>>> -	tun_put(tun);
>>>>   	return mask;
>>>>   }
>>>>
>>>> @@ -708,9 +793,12 @@ static ssize_t tun_get_user(struct tun_file *tfile,
>>>>   		skb_shinfo(skb)->gso_segs = 0;
>>>>   	}
>>>>
>>>> -	tun = __tun_get(tfile);
>>>> -	if (!tun)
>>>> +	rcu_read_lock();
>>>> +	tun = rcu_dereference(tfile->tun);
>>>> +	if (!tun) {
>>>> +		rcu_read_unlock();
>>>>   		return -EBADFD;
>>>> +	}
>>>>
>>>>   	switch (tfile->flags&   TUN_TYPE_MASK) {
>>>>   	case TUN_TUN_DEV:
>>>> @@ -720,26 +808,30 @@ static ssize_t tun_get_user(struct tun_file *tfile,
>>>>   		skb->protocol = eth_type_trans(skb, tun->dev);
>>>>   		break;
>>>>   	}
>>>> -
>>>> -	netif_rx_ni(skb);
>>>>   	tun->dev->stats.rx_packets++;
>>>>   	tun->dev->stats.rx_bytes += len;
>>>> -	tun_put(tun);
>>>> +	rcu_read_unlock();
>>>> +
>>>> +	netif_rx_ni(skb);
>>>> +
>>>>   	return count;
>>>>
>>>>   err_free:
>>>>   	count = -EINVAL;
>>>>   	kfree_skb(skb);
>>>>   err:
>>>> -	tun = __tun_get(tfile);
>>>> -	if (!tun)
>>>> +	rcu_read_lock();
>>>> +	tun = rcu_dereference(tfile->tun);
>>>> +	if (!tun) {
>>>> +		rcu_read_unlock();
>>>>   		return -EBADFD;
>>>> +	}
>>>>
>>>>   	if (drop)
>>>>   		tun->dev->stats.rx_dropped++;
>>>>   	if (error)
>>>>   		tun->dev->stats.rx_frame_errors++;
>>>> -	tun_put(tun);
>>>> +	rcu_read_unlock();
>>>>   	return count;
>>>>   }
>>>>
>>>> @@ -833,12 +925,13 @@ static ssize_t tun_put_user(struct tun_file *tfile,
>>>>   	skb_copy_datagram_const_iovec(skb, 0, iv, total, len);
>>>>   	total += skb->len;
>>>>
>>>> -	tun = __tun_get(tfile);
>>>> +	rcu_read_lock();
>>>> +	tun = rcu_dereference(tfile->tun);
>>>>   	if (tun) {
>>>>   		tun->dev->stats.tx_packets++;
>>>>   		tun->dev->stats.tx_bytes += len;
>>>> -		tun_put(tun);
>>>>   	}
>>>> +	rcu_read_unlock();
>>>>
>>>>   	return total;
>>>>   }
>>>> @@ -869,28 +962,31 @@ static ssize_t tun_do_read(struct tun_file *tfile,
>>>>   				break;
>>>>   			}
>>>>
>>>> -			tun = __tun_get(tfile);
>>>> +			rcu_read_lock();
>>>> +			tun = rcu_dereference(tfile->tun);
>>>>   			if (!tun) {
>>>> -				ret = -EIO;
>>>> +				ret = -EBADFD;
>>> BADFD is for when you get passed something like -1 fd.
>>> Here fd is OK, it's just in a bad state so you can not do IO.
>>>
>> Sure.
>>>> +				rcu_read_unlock();
>>>>   				break;
>>>>   			}
>>>>   			if (tun->dev->reg_state != NETREG_REGISTERED) {
>>>>   				ret = -EIO;
>>>> -				tun_put(tun);
>>>> +				rcu_read_unlock();
>>>>   				break;
>>>>   			}
>>>> -			tun_put(tun);
>>>> +			rcu_read_unlock();
>>>>
>>>>   			/* Nothing to read, let's sleep */
>>>>   			schedule();
>>>>   			continue;
>>>>   		}
>>>>
>>>> -		tun = __tun_get(tfile);
>>>> +		rcu_read_lock();
>>>> +		tun = rcu_dereference(tfile->tun);
>>>>   		if (tun) {
>>>>   			netif_wake_queue(tun->dev);
>>>> -			tun_put(tun);
>>>>   		}
>>>> +		rcu_read_unlock();
>>>>
>>>>   		ret = tun_put_user(tfile, skb, iv, len);
>>>>   		kfree_skb(skb);
>>>> @@ -1038,6 +1134,9 @@ static int tun_flags(struct tun_struct *tun)
>>>>   	if (tun->flags&   TUN_VNET_HDR)
>>>>   		flags |= IFF_VNET_HDR;
>>>>
>>>> +	if (tun->flags&   TUN_TAP_MQ)
>>>> +		flags |= IFF_MULTI_QUEUE;
>>>> +
>>>>   	return flags;
>>>>   }
>>>>
>>>> @@ -1097,8 +1196,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>   		err = tun_attach(tun, file);
>>>>   		if (err<   0)
>>>>   			return err;
>>>> -	}
>>>> -	else {
>>>> +	} else {
>>>>   		char *name;
>>>>   		unsigned long flags = 0;
>>>>
>>>> @@ -1142,6 +1240,8 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>   		dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
>>>>   			TUN_USER_FEATURES;
>>>>   		dev->features = dev->hw_features;
>>>> +		if (ifr->ifr_flags&   IFF_MULTI_QUEUE)
>>>> +			dev->features |= NETIF_F_LLTX;
>>>>
>>>>   		err = register_netdevice(tun->dev);
>>>>   		if (err<   0)
>>>> @@ -1154,7 +1254,7 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>
>>>>   		err = tun_attach(tun, file);
>>>>   		if (err<   0)
>>>> -			goto failed;
>>>> +			goto err_free_dev;
>>>>   	}
>>>>
>>>>   	tun_debug(KERN_INFO, tun, "tun_set_iff\n");
>>>> @@ -1174,6 +1274,11 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>   	else
>>>>   		tun->flags&= ~TUN_VNET_HDR;
>>>>
>>>> +	if (ifr->ifr_flags&   IFF_MULTI_QUEUE)
>>>> +		tun->flags |= TUN_TAP_MQ;
>>>> +	else
>>>> +		tun->flags&= ~TUN_TAP_MQ;
>>>> +
>>>>   	/* Cache flags from tun device */
>>>>   	tfile->flags = tun->flags;
>>>>   	/* Make sure persistent devices do not get stuck in
>>>> @@ -1187,7 +1292,6 @@ static int tun_set_iff(struct net *net, struct file *file, struct ifreq *ifr)
>>>>
>>>>   err_free_dev:
>>>>   	free_netdev(dev);
>>>> -failed:
>>>>   	return err;
>>>>   }
>>>>
>>>> @@ -1264,38 +1368,40 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>>>   				(unsigned int __user*)argp);
>>>>   	}
>>>>
>>>> -	rtnl_lock();
>>>> -
>>>> -	tun = __tun_get(tfile);
>>>> -	if (cmd == TUNSETIFF&&   !tun) {
>>>> +	ret = 0;
>>>> +	if (cmd == TUNSETIFF) {
>>>> +		rtnl_lock();
>>>>   		ifr.ifr_name[IFNAMSIZ-1] = '\0';
>>>> -
>>>>   		ret = tun_set_iff(tfile->net, file,&ifr);
>>>> -
>>>> +		rtnl_unlock();
>>>>   		if (ret)
>>>> -			goto unlock;
>>>> -
>>>> +			return ret;
>>>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>>> -			ret = -EFAULT;
>>>> -		goto unlock;
>>>> +			return -EFAULT;
>>>> +		return ret;
>>>>   	}
>>>>
>>>> +	rtnl_lock();
>>>> +
>>>> +	rcu_read_lock();
>>>> +
>>>>   	ret = -EBADFD;
>>>> +	tun = rcu_dereference(tfile->tun);
>>>>   	if (!tun)
>>>>   		goto unlock;
>>>> +	else
>>>> +		ret = 0;
>>>>
>>>> -	tun_debug(KERN_INFO, tun, "tun_chr_ioctl cmd %d\n", cmd);
>>>> -
>>>> -	ret = 0;
>>>>   	switch (cmd) {
>>>>   	case TUNGETIFF:
>>>>   		ret = tun_get_iff(current->nsproxy->net_ns, tun,&ifr);
>>>> +		rcu_read_unlock();
>>>>   		if (ret)
>>>> -			break;
>>>> +			goto out;
>>>>
>>>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>>>   			ret = -EFAULT;
>>>> -		break;
>>>> +		goto out;
>>>>
>>>>   	case TUNSETNOCSUM:
>>>>   		/* Disable/Enable checksum */
>>>> @@ -1357,9 +1463,10 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>>>   		/* Get hw address */
>>>>   		memcpy(ifr.ifr_hwaddr.sa_data, tun->dev->dev_addr, ETH_ALEN);
>>>>   		ifr.ifr_hwaddr.sa_family = tun->dev->type;
>>>> +		rcu_read_unlock();
>>>>   		if (copy_to_user(argp,&ifr, ifreq_len))
>>>>   			ret = -EFAULT;
>>>> -		break;
>>>> +		goto out;
>>>>
>>>>   	case SIOCSIFHWADDR:
>>>>   		/* Set hw address */
>>>> @@ -1375,9 +1482,9 @@ static long __tun_chr_ioctl(struct file *file, unsigned int cmd,
>>>>   	}
>>>>
>>>>   unlock:
>>>> +	rcu_read_unlock();
>>>> +out:
>>>>   	rtnl_unlock();
>>>> -	if (tun)
>>>> -		tun_put(tun);
>>>>   	return ret;
>>>>   }
>>>>
>>>> @@ -1517,6 +1624,11 @@ out:
>>>>   	return ret;
>>>>   }
>>>>
>>>> +static void tun_sock_destruct(struct sock *sk)
>>>> +{
>>>> +	skb_queue_purge(&sk->sk_receive_queue);
>>>> +}
>>>> +
>>>>   static int tun_chr_open(struct inode *inode, struct file * file)
>>>>   {
>>>>   	struct net *net = current->nsproxy->net_ns;
>>>> @@ -1540,6 +1652,7 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>>>   	sock_init_data(&tfile->socket,&tfile->sk);
>>>>
>>>>   	tfile->sk.sk_write_space = tun_sock_write_space;
>>>> +	tfile->sk.sk_destruct = tun_sock_destruct;
>>>>   	tfile->sk.sk_sndbuf = INT_MAX;
>>>>   	file->private_data = tfile;
>>>>
>>>> @@ -1549,31 +1662,8 @@ static int tun_chr_open(struct inode *inode, struct file * file)
>>>>   static int tun_chr_close(struct inode *inode, struct file *file)
>>>>   {
>>>>   	struct tun_file *tfile = file->private_data;
>>>> -	struct tun_struct *tun;
>>>> -
>>>> -	tun = __tun_get(tfile);
>>>> -	if (tun) {
>>>> -		struct net_device *dev = tun->dev;
>>>> -
>>>> -		tun_debug(KERN_INFO, tun, "tun_chr_close\n");
>>>> -
>>>> -		__tun_detach(tun);
>>>> -
>>>> -		/* If desirable, unregister the netdevice. */
>>>> -		if (!(tun->flags&   TUN_PERSIST)) {
>>>> -			rtnl_lock();
>>>> -			if (dev->reg_state == NETREG_REGISTERED)
>>>> -				unregister_netdevice(dev);
>>>> -			rtnl_unlock();
>>>> -		}
>>>>
>>>> -		/* drop the reference that netdevice holds */
>>>> -		sock_put(&tfile->sk);
>>>> -
>>>> -	}
>>>> -
>>>> -	/* drop the reference that file holds */
>>>> -	sock_put(&tfile->sk);
>>>> +	tun_detach(tfile, true);
>>>>
>>>>   	return 0;
>>>>   }
>>>> @@ -1700,14 +1790,17 @@ static void tun_cleanup(void)
>>>>    * holding a reference to the file for as long as the socket is in use. */
>>>>   struct socket *tun_get_socket(struct file *file)
>>>>   {
>>>> -	struct tun_struct *tun;
>>>> +	struct tun_struct *tun = NULL;
>>>>   	struct tun_file *tfile = file->private_data;
>>>>   	if (file->f_op !=&tun_fops)
>>>>   		return ERR_PTR(-EINVAL);
>>>> -	tun = tun_get(file);
>>>> -	if (!tun)
>>>> +	rcu_read_lock();
>>>> +	tun = rcu_dereference(tfile->tun);
>>>> +	if (!tun) {
>>>> +		rcu_read_unlock();
>>>>   		return ERR_PTR(-EBADFD);
>>>> -	tun_put(tun);
>>>> +	}
>>>> +	rcu_read_unlock();
>>>>   	return&tfile->socket;
>>>>   }
>>>>   EXPORT_SYMBOL_GPL(tun_get_socket);

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html