[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHS8izNmrrO=q4vqGJ+mAQg52s3KqBXjdbrf=AgCUpHpS4oB7w@mail.gmail.com>
Date: Wed, 7 May 2025 11:28:14 -0700
From: Mina Almasry <almasrymina@...gle.com>
To: Taehee Yoo <ap420073@...il.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
edumazet@...gle.com, andrew+netdev@...n.ch, horms@...nel.org, sdf@...ichev.me,
netdev@...r.kernel.org, asml.silence@...il.com, dw@...idwei.uk,
skhawaja@...gle.com, willemb@...gle.com, jdamato@...tly.com
Subject: Re: [PATCH net v2] net: devmem: fix kernel panic when socket close
after module unload
On Tue, May 6, 2025 at 7:09 AM Taehee Yoo <ap420073@...il.com> wrote:
>
> Kernel panic occurs when a devmem TCP socket is closed after NIC module
> is unloaded.
>
> This is Devmem TCP unregistration scenarios. number is an order.
> (a)socket close
Lets call this "netlink socket close", to differentiate between
netlink and data sockets.
> (b)pp destroy
> (c)uninstall result
> 1 2 3 OK
> 1 3 2 (d)Impossible
> 2 1 3 OK
> 3 1 2 (e)Kernel panic
> 2 3 1 (d)Impossible
> 3 2 1 (d)Impossible
>
> (a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
> closed.
> (b) page_pool_destroy() is called when the interface is down.
> (c) mp_ops->uninstall() is called when an interface is unregistered.
> (d) There is no scenario in mp_ops->uninstall() is called before
> page_pool_destroy().
> Because unregister_netdevice_many_notify() closes interfaces first
> and then calls mp_ops->uninstall().
> (e) netdev_nl_sock_priv_destroy() accesses struct net_device to acquire
> netdev_lock().
> But if the interface module has already been removed, net_device
> pointer is invalid, so it causes kernel panic.
>
> In summary, there are only 3 possible scenarios.
> A. sk close -> pp destroy -> uninstall.
> B. pp destroy -> sk close -> uninstall.
> C. pp destroy -> uninstall -> sk close.
>
> Case C is a kernel panic scenario.
>
> In order to fix this problem, It makes mp_dmabuf_devmem_uninstall() set
> binding->dev to NULL.
> It indicates an bound net_device was unregistered.
>
> It makes netdev_nl_sock_priv_destroy() do not acquire netdev_lock()
> if binding->dev is NULL.
>
> It inverts socket/netdev lock order like below:
> netdev_lock();
> mutex_lock(&priv->lock);
> mutex_unlock(&priv->lock);
> netdev_unlock();
>
> Because of inversion of locking ordering, mp_dmabuf_devmem_uninstall()
> acquires socket lock from now on.
>
> Tests:
> Scenario A:
> ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
> -v 7 -t 1 -q 1 &
> pid=$!
> sleep 10
> kill $pid
> ip link set $interface down
> modprobe -rv $module
>
> Scenario B:
> ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
> -v 7 -t 1 -q 1 &
> pid=$!
> sleep 10
> ip link set $interface down
> kill $pid
> modprobe -rv $module
>
> Scenario C:
> ./ncdevmem -s 192.168.1.4 -c 192.168.1.2 -f $interface -l -p 8000 \
> -v 7 -t 1 -q 1 &
> pid=$!
> sleep 10
> modprobe -rv $module
> sleep 5
> kill $pid
>
> Splat looks like:
> Oops: general protection fault, probably for non-canonical address 0xdffffc001fffa9f7: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN NOPTI
> KASAN: probably user-memory-access in range [0x00000000fffd4fb8-0x00000000fffd4fbf]
> CPU: 0 UID: 0 PID: 2041 Comm: ncdevmem Tainted: G B W 6.15.0-rc1+ #2 PREEMPT(undef) 0947ec89efa0fd68838b78e36aa1617e97ff5d7f
> Tainted: [B]=BAD_PAGE, [W]=WARN
> RIP: 0010:__mutex_lock (./include/linux/sched.h:2244 kernel/locking/mutex.c:400 kernel/locking/mutex.c:443 kernel/locking/mutex.c:605 kernel/locking/mutex.c:746)
> Code: ea 03 80 3c 02 00 0f 85 4f 13 00 00 49 8b 1e 48 83 e3 f8 74 6a 48 b8 00 00 00 00 00 fc ff df 48 8d 7b 34 48 89 fa 48 c1 ea 03 <0f> b6 f
> RSP: 0018:ffff88826f7ef730 EFLAGS: 00010203
> RAX: dffffc0000000000 RBX: 00000000fffd4f88 RCX: ffffffffaa9bc811
> RDX: 000000001fffa9f7 RSI: 0000000000000008 RDI: 00000000fffd4fbc
> RBP: ffff88826f7ef8b0 R08: 0000000000000000 R09: ffffed103e6aa1a4
> R10: 0000000000000007 R11: ffff88826f7ef442 R12: fffffbfff669f65e
> R13: ffff88812a830040 R14: ffff8881f3550d20 R15: 00000000fffd4f88
> FS: 0000000000000000(0000) GS:ffff888866c05000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000563bed0cb288 CR3: 00000001a7c98000 CR4: 00000000007506f0
> PKRU: 55555554
> Call Trace:
> <TASK>
> ...
> netdev_nl_sock_priv_destroy (net/core/netdev-genl.c:953 (discriminator 3))
> genl_release (net/netlink/genetlink.c:653 net/netlink/genetlink.c:694 net/netlink/genetlink.c:705)
> ...
> netlink_release (net/netlink/af_netlink.c:737)
> ...
> __sock_release (net/socket.c:647)
> sock_close (net/socket.c:1393)
>
> Fixes: 1d22d3060b9b ("net: drop rtnl_lock for queue_mgmt operations")
> Signed-off-by: Taehee Yoo <ap420073@...il.com>
> ---
>
> v2:
> - Fix commit message.
> - Correct Fixes tag.
> - Inverse locking order.
> - Do not put a reference count of binding in
> mp_dmabuf_devmem_uninstall().
>
> In order to test this patch, driver side implementation of devmem TCP[1]
> is needed to be applied.
>
> [1] https://lore.kernel.org/netdev/20250415052458.1260575-1-ap420073@gmail.com/T/#u
>
> net/core/devmem.c | 6 ++++++
> net/core/devmem.h | 3 +++
> net/core/netdev-genl.c | 27 ++++++++++++++++++---------
> 3 files changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/net/core/devmem.c b/net/core/devmem.c
> index 6e27a47d0493..636c1e82b8da 100644
> --- a/net/core/devmem.c
> +++ b/net/core/devmem.c
> @@ -167,6 +167,7 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
>
> struct net_devmem_dmabuf_binding *
> net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> + struct netdev_nl_sock *priv,
> struct netlink_ext_ack *extack)
> {
> struct net_devmem_dmabuf_binding *binding;
> @@ -189,6 +190,7 @@ net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> }
>
> binding->dev = dev;
> + binding->priv = priv;
>
> err = xa_alloc_cyclic(&net_devmem_dmabuf_bindings, &binding->id,
> binding, xa_limit_32b, &id_alloc_next,
> @@ -376,12 +378,16 @@ static void mp_dmabuf_devmem_uninstall(void *mp_priv,
> struct netdev_rx_queue *bound_rxq;
> unsigned long xa_idx;
>
> + mutex_lock(&binding->priv->lock);
> xa_for_each(&binding->bound_rxqs, xa_idx, bound_rxq) {
> if (bound_rxq == rxq) {
> xa_erase(&binding->bound_rxqs, xa_idx);
> + if (xa_empty(&binding->bound_rxqs))
> + binding->dev = NULL;
priv->lock is meant to protect priv->bindings only. To be honest, I
find priv->lock being used to protect both priv->bindings and
bindings->dev is extremely confusing. I think it may be contributing
to the convoluted lock ordering in netdev_nl_sock_priv_destroy. It
also makes it such that the binding needs to have a reference to the
netlink socket that owns it which is mentally confusing to me. The
binding should abstractly speaking not need to know anything about the
netlink socket that holds it. Also, AFAICT this may be buggy. The
binding may outlive the netlink socket. For example when the netlink
socket is closed but the binding is kept alive because there are
references to the netmems in it, UAF may happen.
Instead of this, if you need to protect concurrent access to
binding->dev, either:
1. create new spin_lock, binding->dev_lock, which protects access to
binding->dev, or
2. Use rcu protection to lockelessly set binding->dev, or
3. Use some cmpxchg logic to locklessly set/query binding->dev and
detect if it got modified in a racing thread.
But please no multiplexing priv->lock to protect both priv->bindings
and binding->dev.
> break;
> }
> }
> + mutex_unlock(&binding->priv->lock);
> }
>
> static const struct memory_provider_ops dmabuf_devmem_ops = {
> diff --git a/net/core/devmem.h b/net/core/devmem.h
> index 7fc158d52729..afd6320b2c9b 100644
> --- a/net/core/devmem.h
> +++ b/net/core/devmem.h
> @@ -11,6 +11,7 @@
> #define _NET_DEVMEM_H
>
> #include <net/netmem.h>
> +#include <net/netdev_netlink.h>
>
> struct netlink_ext_ack;
>
> @@ -20,6 +21,7 @@ struct net_devmem_dmabuf_binding {
> struct sg_table *sgt;
> struct net_device *dev;
> struct gen_pool *chunk_pool;
> + struct netdev_nl_sock *priv;
>
> /* The user holds a ref (via the netlink API) for as long as they want
> * the binding to remain alive. Each page pool using this binding holds
> @@ -63,6 +65,7 @@ struct dmabuf_genpool_chunk_owner {
> void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding);
> struct net_devmem_dmabuf_binding *
> net_devmem_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd,
> + struct netdev_nl_sock *priv,
> struct netlink_ext_ack *extack);
> void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding);
> int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx,
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 230743bdbb14..b8bb73574276 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -859,13 +859,11 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> goto err_genlmsg_free;
> }
>
> - mutex_lock(&priv->lock);
> -
> err = 0;
> netdev = netdev_get_by_index_lock(genl_info_net(info), ifindex);
> if (!netdev) {
> err = -ENODEV;
> - goto err_unlock_sock;
> + goto err_genlmsg_free;
> }
> if (!netif_device_present(netdev))
> err = -ENODEV;
> @@ -877,10 +875,11 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> goto err_unlock;
> }
>
> - binding = net_devmem_bind_dmabuf(netdev, dmabuf_fd, info->extack);
> + mutex_lock(&priv->lock);
> + binding = net_devmem_bind_dmabuf(netdev, dmabuf_fd, priv, info->extack);
I'm not so sure about this as well. priv->lock should protect access
to priv->bindings only. I'm not sure why it's locked before
net_devmem_bind_dmabuf? Can it be locked around the access to
priv->bindings only?
> if (IS_ERR(binding)) {
> err = PTR_ERR(binding);
> - goto err_unlock;
> + goto err_unlock_sock;
> }
>
> nla_for_each_attr_type(attr, NETDEV_A_DMABUF_QUEUES,
> @@ -921,18 +920,17 @@ int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info)
> if (err)
> goto err_unbind;
>
> - netdev_unlock(netdev);
> -
> mutex_unlock(&priv->lock);
> + netdev_unlock(netdev);
>
> return 0;
>
> err_unbind:
> net_devmem_unbind_dmabuf(binding);
> -err_unlock:
> - netdev_unlock(netdev);
> err_unlock_sock:
> mutex_unlock(&priv->lock);
> +err_unlock:
> + netdev_unlock(netdev);
> err_genlmsg_free:
> nlmsg_free(rsp);
> return err;
> @@ -948,14 +946,25 @@ void netdev_nl_sock_priv_destroy(struct netdev_nl_sock *priv)
> {
> struct net_devmem_dmabuf_binding *binding;
> struct net_devmem_dmabuf_binding *temp;
> + netdevice_tracker dev_tracker;
> struct net_device *dev;
>
> mutex_lock(&priv->lock);
> list_for_each_entry_safe(binding, temp, &priv->bindings, list) {
> dev = binding->dev;
> + if (!dev) {
> + net_devmem_unbind_dmabuf(binding);
> + continue;
> + }
> + netdev_hold(dev, &dev_tracker, GFP_KERNEL);
> + mutex_unlock(&priv->lock);
> netdev_lock(dev);
> + mutex_lock(&priv->lock);
> net_devmem_unbind_dmabuf(binding);
> + mutex_unlock(&priv->lock);
> netdev_unlock(dev);
> + netdev_put(dev, &dev_tracker);
> + mutex_lock(&priv->lock);
> }
> mutex_unlock(&priv->lock);
> }
> --
> 2.34.1
>
--
Thanks,
Mina
Powered by blists - more mailing lists