[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z_6Yg5UjNRLQA9mH@mini-arch>
Date: Tue, 15 Apr 2025 10:33:55 -0700
From: Stanislav Fomichev <stfomichev@...il.com>
To: Taehee Yoo <ap420073@...il.com>
Cc: davem@...emloft.net, kuba@...nel.org, pabeni@...hat.com,
edumazet@...gle.com, andrew+netdev@...n.ch, horms@...nel.org,
almasrymina@...gle.com, asml.silence@...il.com, dw@...idwei.uk,
sdf@...ichev.me, skhawaja@...gle.com, simona.vetter@...ll.ch,
kaiyuanz@...gle.com, netdev@...r.kernel.org
Subject: Re: [PATCH net] net: devmem: fix kernel panic when socket close
after module unload
On 04/15, Taehee Yoo wrote:
> Kernel panic occurs when a devmem TCP socket is closed after NIC module
> is unloaded.
>
> This is Devmem TCP unregistration scenarios. number is an order.
> (a)socket close (b)pp destroy (c)uninstall result
> 1 2 3 OK
> 1 3 2 (d)Impossible
> 2 1 3 OK
> 3 1 2 (e)Kernel panic
> 2 3 1 (d)Impossible
> 3 2 1 (d)Impossible
>
> (a) netdev_nl_sock_priv_destroy() is called when devmem TCP socket is
> closed.
> (b) page_pool_destroy() is called when the interface is down.
> (c) mp_ops->uninstall() is called when an interface is unregistered.
> (d) There is no scenario in mp_ops->uninstall() is called before
> page_pool_destroy().
> Because unregister_netdevice_many_notify() closes interfaces first
> and then calls mp_ops->uninstall().
> (e) netdev_nl_sock_priv_destroy() accesses struct net_device.
> But if the interface module has already been removed, net_device
> pointer is invalid, so it causes kernel panic.
>
> In summary, there are only 3 possible scenarios.
> A. sk close -> pp destroy -> uninstall.
> B. pp destroy -> sk close -> uninstall.
> C. pp destroy -> uninstall -> sk close.
>
> Case C is a kernel panic scenario.
>
> In order to fix this problem, it makes netdev_nl_sock_priv_destroy() do
> nothing if a module is already removed.
> The mp_ops->uninstall() handles these instead.
>
> The netdev_nl_sock_priv_destroy() iterates binding->list and releases
> them all with net_devmem_unbind_dmabuf().
> The net_devmem_unbind_dmabuf() has the below steps.
> 1. Delete binding from a list.
[..]
> 2. Call _net_mp_close_rxq() for all rxq's bound to a binding.
This should not happen in the (C) case, right? mp_dmabuf_devmem_uninstall
removes the rxq from the xa. Can you explain more on why specifically
the crash occurs? Are we missing some check in netdev_nl_sock_priv_destroy
that makes sure the device is still alive?
Powered by blists - more mailing lists