lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <bb595640-0597-4d18-a9e1-f6eb8e6bb50e@I-love.SAKURA.ne.jp>
Date: Fri, 22 Aug 2025 19:23:48 +0900
From: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To: Oleksij Rempel <o.rempel@...gutronix.de>
Cc: Robin van der Gracht <robin@...tonic.nl>, kernel@...gutronix.de,
        Oliver Hartkopp <socketcan@...tkopp.net>,
        Marc Kleine-Budde <mkl@...gutronix.de>, linux-can@...r.kernel.org,
        LKML <linux-kernel@...r.kernel.org>,
        Network Development <netdev@...r.kernel.org>
Subject: Re: can/j1939: hung inside rtnl_dellink()

(Adding netdev ML to ask for hints from different network protocols...)

On 2025/08/22 18:51, Oleksij Rempel wrote:
> Hello Tetsuo,
> 
> On Sat, Aug 16, 2025 at 03:51:54PM +0900, Tetsuo Handa wrote:
>> Hello.
>>
>> I made a minimized C reproducer for
>>
>>   unregister_netdevice: waiting for vcan0 to become free. Usage count = 2
>>
>> problem at https://syzkaller.appspot.com/bug?extid=881d65229ca4f9ae8c84 , and
>> obtained some data using debug printk() patch. It seems that the cause is
>> net/can/j1939/ does not handle NETDEV_UNREGISTER notification
>> while net/can/j1939/ can directly call rtnl_dellink() via sendmsg().
> 
> Sorry for long delay and than you for your investigation!
> 
>> The minimized C reproducer is shown below.
> ....
> 
>> Therefore, I guess that either
>>
>>   j1939_netdev_notify() is handling NETDEV_UNREGISTER notification

Oops. I wanted to write

     j1939_netdev_notify() is *not* handling NETDEV_UNREGISTER notification

>>
>> or
>>
>>   rtnl_dellink() can be called via sendmsg() despite the j1939 socket
>>   are in use
>>
>> is wrong. How to fix this problem?
> 
> I assume the first variant is correct. Can you please test following change:
> --- a/net/can/j1939/main.c
> +++ b/net/can/j1939/main.c
> @@ -370,6 +370,7 @@
>  		goto notify_done;
>  
>  	switch (msg) {
> +	case NETDEV_UNREGISTER:
>  	case NETDEV_DOWN:
>  		j1939_cancel_active_session(priv, NULL);
>  		j1939_sk_netdev_event_netdown(priv);
> 

Such change is not sufficient.

As far as I tested, the only way that can drop the refcount to 1 is to
call j1939_sk_release() (which involves sock_put()) on all j1939 sockets
(i.e. something like shown below).


diff --git a/net/can/j1939/j1939-priv.h b/net/can/j1939/j1939-priv.h
index 31a93cae5111..81f58924b4ac 100644
--- a/net/can/j1939/j1939-priv.h
+++ b/net/can/j1939/j1939-priv.h
@@ -212,6 +212,7 @@ void j1939_priv_get(struct j1939_priv *priv);

 /* notify/alert all j1939 sockets bound to ifindex */
 void j1939_sk_netdev_event_netdown(struct j1939_priv *priv);
+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv);
 int j1939_cancel_active_session(struct j1939_priv *priv, struct sock *sk);
 void j1939_tp_init(struct j1939_priv *priv);

diff --git a/net/can/j1939/main.c b/net/can/j1939/main.c
index 7e8a20f2fc42..e568b5928a39 100644
--- a/net/can/j1939/main.c
+++ b/net/can/j1939/main.c
@@ -377,6 +377,11 @@ static int j1939_netdev_notify(struct notifier_block *nb,
                j1939_sk_netdev_event_netdown(priv);
                j1939_ecu_unmap_all(priv);
                break;
+       case NETDEV_UNREGISTER:
+               pr_info("NETDEV_UNREGISTER notification on %px start\n", ndev);
+               j1939_sk_netdev_event_unregister(priv);
+               pr_info("NETDEV_UNREGISTER notification on %px end\n", ndev);
+               break;
        }

        j1939_priv_put(priv);
diff --git a/net/can/j1939/socket.c b/net/can/j1939/socket.c
index 3d8b588822f9..4e53a1b10907 100644
--- a/net/can/j1939/socket.c
+++ b/net/can/j1939/socket.c
@@ -1300,6 +1313,24 @@ void j1939_sk_netdev_event_netdown(struct j1939_priv *priv)
        read_unlock_bh(&priv->j1939_socks_lock);
 }

+void j1939_sk_netdev_event_unregister(struct j1939_priv *priv)
+{
+       struct j1939_sock *jsk;
+       struct socket sock = { };
+
+ rescan:
+       read_lock_bh(&priv->j1939_socks_lock);
+       list_for_each_entry(jsk, &priv->j1939_socks, list) {
+               read_unlock_bh(&priv->j1939_socks_lock);
+               pr_info("Releasing %px\n", &jsk->sk);
+               sock.sk = &jsk->sk;
+               //sock_hold(&jsk->sk);
+               j1939_sk_release(&sock);
+               goto rescan;
+       }
+       read_unlock_bh(&priv->j1939_socks_lock);
+}
+
 static int j1939_sk_no_ioctlcmd(struct socket *sock, unsigned int cmd,
                                unsigned long arg)
 {


Of course, calling sk_j1939_sk_release() upon NETDEV_UNREGISTER event
causes refcount underflow bug. But calling sock_hold() before calling
j1939_sk_release() prevents the refcount from dropping to 1. :-(

I think we need to somehow make it possible to logically close j1939
sockets without actually closing. Maybe something like 
"struct in_device"->dead flag which is set by inetdev_destroy() upon
NETDEV_UNREGISTER event is needed by j1939 sockets...

My build environment is very slow (testing on VMWare on a Windows PC).
Running my simplified reproducer on your build environment would be
much faster.


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ