[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date: Mon, 9 Sep 2013 16:15:11 -0700
From: Francesco Ruggeri <fruggeri@...stanetworks.com>
To: "David S. Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>,
Jiri Pirko <jiri@...nulli.us>,
Alexander Duyck <alexander.h.duyck@...el.com>,
Cong Wang <amwang@...hat.com>, netdev@...r.kernel.org
Cc: Francesco Ruggeri <fruggeri@...stanetworks.com>
Subject: [PATCH 1/1] net: race condition when removing virtual net_device
There is a race condition when removing a net_device while its net namespace
is also being removed.
This can result in a crash or other unpredictable behavior.
This is a sample scenario with veth, but the same applies to other virtual
devices such as vlan and macvlan.
veth pair v0-v1 is created, with v0 in namespace ns0 and v1 in ns1.
Process p0 deletes v0. v1 is also unregistered (in veth_dellink), so when p0
gets to netdev_run_todo both v0 and v1 are in net_todo_list, and they have both
been unlisted from their respective namespaces. If all references to v1 have not
already been dropped then netdev_run_todo/netdev_wait_allrefs will call netdev
notifiers for v1, releasing the rtnl lock between calls.
Now process p1 removes namespace ns1. v1 will not prevent this from happening
(in default_device_exit_batch) since it was already unlisted by p0.
Next time p0 invokes the notifiers for v1 any notifiers that use dev_net(v1)
will get a pointer to a namespace that has been or is being removed.
Similar scenarios apply with v1 as a vlan or macvlan interface and v0 as its
real device.
We hit this problem in 3.4 with sequence fib_netdev_event, fib_disable_ip,
rt_cache_flush, rt_cache_invalidate, inetpeer_invalidate_tree. That sequence no
longer applies in later kernels, but unless we can guarantee that no
NETDEV_UNREGISTER or NETDEV_UNREGISTER_FINAL handler accesses a net_device's
dev_net(dev) then there is a vulnerability (this happens for example with
NETDEV_UNREGISTER_FINAL in dst_dev_event/dst_ifdown).
Commit 0115e8e3 later made things better by reducing the chances of this
happening, but the underlying problem still seems to be there.
I would like to get some feedback on this patch.
The idea is to take a reference to the loopback_dev of a net_device's
namespace (which is always the last net_device to be removed when a namespace
is destroyed) when the net_device is unlisted, and release it when the
net_device is disposed of.
To avoid deadlocks in cleanup_net all loopback_devs are queued at the end
of net_todo_list.
Tested on Linux 3.4.
Signed-off-by: Francesco Ruggeri <fruggeri@...stanetworks.com>
---
net/core/dev.c | 30 +++++++++++++++++++++++++++++-
1 files changed, 29 insertions(+), 1 deletions(-)
diff --git a/net/core/dev.c b/net/core/dev.c
index 26755dd..da2fd78 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -225,10 +225,14 @@ static void list_netdevice(struct net_device *dev)
}
/* Device list removal
+ * Take a reference to dev_net(dev)->loopback_dev, so dev_net(dev)
+ * will not be freed until we are done with dev.
* caller must respect a RCU grace period before freeing/reusing dev
*/
static void unlist_netdevice(struct net_device *dev)
{
+ struct net_device *loopback_dev = dev_net(dev)->loopback_dev;
+
ASSERT_RTNL();
/* Unlink dev from the device chain */
@@ -238,9 +242,23 @@ static void unlist_netdevice(struct net_device *dev)
hlist_del_rcu(&dev->index_hlist);
write_unlock_bh(&dev_base_lock);
+ if (dev != loopback_dev)
+ dev_hold(loopback_dev);
+
dev_base_seq_inc(dev_net(dev));
}
+/**
+ * Called when a net_device that has been previously unlisted from a net
+ * namespace is disposed of.
+ */
+static inline void unlist_netdevice_done(struct net_device *dev)
+{
+ struct net_device *loopback_dev = dev_net(dev)->loopback_dev;
+ if (dev != loopback_dev)
+ dev_put(loopback_dev);
+}
+
/*
* Our notifier list
*/
@@ -5009,10 +5027,17 @@ static int dev_new_index(struct net *net)
/* Delayed registration/unregisteration */
static LIST_HEAD(net_todo_list);
+static struct list_head *first_loopback_dev = &net_todo_list;
static void net_set_todo(struct net_device *dev)
{
- list_add_tail(&dev->todo_list, &net_todo_list);
+ /* All loopback_devs go to end of net_todo_list. */
+ if (dev == dev_net(dev)->loopback_dev) {
+ list_add_tail(&dev->todo_list, &net_todo_list);
+ if (first_loopback_dev == &net_todo_list)
+ first_loopback_dev = &dev->todo_list;
+ } else
+ list_add_tail(&dev->todo_list, first_loopback_dev);
}
static void rollback_registered_many(struct list_head *head)
@@ -5641,6 +5666,7 @@ void netdev_run_todo(void)
/* Snapshot list, allow later requests */
list_replace_init(&net_todo_list, &list);
+ first_loopback_dev = &net_todo_list;
__rtnl_unlock();
@@ -5670,6 +5696,7 @@ void netdev_run_todo(void)
on_each_cpu(flush_backlog, dev, 1);
netdev_wait_allrefs(dev);
+ unlist_netdevice_done(dev);
/* paranoia */
BUG_ON(netdev_refcnt_read(dev));
@@ -6076,6 +6103,7 @@ int dev_change_net_namespace(struct net_device *dev, struct net *net, const char
kobject_uevent(&dev->dev.kobj, KOBJ_REMOVE);
/* Actually switch the network namespace */
+ unlist_netdevice_done(dev);
dev_net_set(dev, net);
/* If there is an ifindex conflict assign a new one */
--
1.7.4.4
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists