[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bb595640-0597-4d18-a9e1-f6eb8e6bb50e@I-love.SAKURA.ne.jp>
Date: Fri, 22 Aug 2025 19:23:48 +0900
From: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To: Oleksij Rempel <o.rempel@...gutronix.de>
Cc: Robin van der Gracht <robin@...tonic.nl>, kernel@...gutronix.de,
Oliver Hartkopp <socketcan@...tkopp.net>,
Marc Kleine-Budde <mkl@...gutronix.de>, linux-can@...r.kernel.org,
LKML <linux-kernel@...r.kernel.org>,
Network Development <netdev@...r.kernel.org>
Subject: Re: can/j1939: hung inside rtnl_dellink()
(Adding netdev ML to ask for hints from different network protocols...)
On 2025/08/22 18:51, Oleksij Rempel wrote:
> Hello Tetsuo,
>
> On Sat, Aug 16, 2025 at 03:51:54PM +0900, Tetsuo Handa wrote:
>> Hello.
>>
>> I made a minimized C reproducer for
>>
>> unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
>>
>> problem at https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 , and
>> obtained some data using debug printk() patch. It seems that the cause is
>> net/can/j1939/ does not handle NETDEV_UNREGISTER notification
>> while net/can/j1939/ can directly call rtnl_dellink() via sendmsg().
>
> Sorry for long delay and than you for your investigation!
>
>> The minimized C reproducer is shown below.
> ....
>
>> Therefore, I guess that either
>>
>> j1939_netdev_notify() is handling NETDEV_UNREGISTER notification
Oops. I wanted to write
j1939_netdev_notify() is *not* handling NETDEV_UNREGISTER notification
>>
>> or
>>
>> rtnl_dellink() can be called via sendmsg() despite the j1939 socket
>> are in use
>>
>> is wrong. How to fix this problem?
>
> I assume the first variant is correct. Can you please test following change:
> --- a/net/can/j1939/main.c
> +++ b/net/can/j1939/main.c
> @@ -370,6 +370,7 @@
> goto notify_done;
>
> switch (msg) {
> + case NETDEV_UNREGISTER:
> case NETDEV_DOWN:
> j1939_cancel_active_session(priv, NULL);
> j1939_sk_netdev_event_netdown(priv);
>
Such change is not sufficient.
As far as I tested, the only way that can drop the refcount to 1 is to
call j1939_sk_release() (which involves sock_put()) on all j1939 sockets
(i.e. something like shown below).
diff --git a/net/can/j1939/j1939-priv.h b/net/can/j1939/j1939-priv.h
index 31a93cae5111..81f58924b4ac 100644
--- a/net/can/j1939/j1939-priv.h
+++ b/net/can/j1939/j1939-priv.h
@@ -212,6 +212,7 @@ void j1939_priv_get(struct j1939_priv *priv);
/* notify/alert all j1939 sockets bound to ifindex */
void j1939_sk_netdev_event_netdown(struct j1939_priv *priv);
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv);
int j1939_cancel_active_session(struct j1939_priv *priv, struct sock *sk);
void j1939_tp_init(struct j1939_priv *priv);
diff --git a/net/can/j1939/main.c b/net/can/j1939/main.c
index 7e8a20f2fc42..e568b5928a39 100644
--- a/net/can/j1939/main.c
+++ b/net/can/j1939/main.c
@@ -377,6 +377,11 @@ static int j1939_netdev_notify(struct notifier_block *nb,
j1939_sk_netdev_event_netdown(priv);
j1939_ecu_unmap_all(priv);
break;
+ case NETDEV_UNREGISTER:
+ pr_info("NETDEV_UNREGISTER notification on %px start\n", ndev);
+ j1939_sk_netdev_event_unregister(priv);
+ pr_info("NETDEV_UNREGISTER notification on %px end\n", ndev);
+ break;
}
j1939_priv_put(priv);
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index 3d8b588822f9..4e53a1b10907 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c
@@ -1300,6 +1313,24 @@ void j1939_sk_netdev_event_netdown(struct j1939_priv *priv)
read_unlock_bh(&priv->j1939_socks_lock);
}
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv)
+{
+ struct j1939_sock *jsk;
+ struct socket sock = { };
+
+ rescan:
+ read_lock_bh(&priv->j1939_socks_lock);
+ list_for_each_entry(jsk, &priv->j1939_socks, list) {
+ read_unlock_bh(&priv->j1939_socks_lock);
+ pr_info("Releasing %px\n", &jsk->sk);
+ sock.sk = &jsk->sk;
+ //sock_hold(&jsk->sk);
+ j1939_sk_release(&sock);
+ goto rescan;
+ }
+ read_unlock_bh(&priv->j1939_socks_lock);
+}
+
static int j1939_sk_no_ioctlcmd(struct socket *sock, unsigned int cmd,
unsigned long arg)
{
Of course, calling sk_j1939_sk_release() upon NETDEV_UNREGISTER event
causes refcount underflow bug. But calling sock_hold() before calling
j1939_sk_release() prevents the refcount from dropping to 1. :-(
I think we need to somehow make it possible to logically close j1939
sockets without actually closing. Maybe something like
"struct in_device"->dead flag which is set by inetdev_destroy() upon
NETDEV_UNREGISTER event is needed by j1939 sockets...
My build environment is very slow (testing on VMWare on a Windows PC).
Running my simplified reproducer on your build environment would be
much faster.
Powered by blists - more mailing lists