[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250401202810.81863-1-kuniyu@amazon.com>
Date: Tue, 1 Apr 2025 13:26:54 -0700
From: Kuniyuki Iwashima <kuniyu@...zon.com>
To: <wangzhaolong1@...wei.com>
CC: <edumazet@...gle.com>, <ematsumiya@...e.de>, <kuniyu@...zon.com>,
<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
<netdev@...r.kernel.org>, <smfrench@...il.com>, <zhangchangzhong@...wei.com>
Subject: Re: Fwd: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod
(corrected netdev list)
From: Wang Zhaolong <wangzhaolong1@...wei.com>
Date: Tue, 1 Apr 2025 21:54:47 +0800
> Hi.
>
> My colleagues and I have been investigating the issue addressed by this patch
> and have discovered some significant concerns that require broader discussion.
>
> ### Socket Leak Issue
>
> After testing this patch extensively, I've confirmed it introduces a socket leak
> when TCP connections don't complete proper termination (e.g., when FIN packets
> are dropped). The leak manifests as a continuous increase in TCP slab usage:
>
> I've documented this issue with a reproducer in Bugzilla:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=219972#c0
>
> The key issue appears to stem from the interaction between the SMB client and TCP
> socket lifecycle management:
>
> 1. Removing `sk->sk_net_refcnt = 1` causes TCP timers to be terminated early in
> `tcp_close()` via the `inet_csk_clear_xmit_timers_sync()` call when
> `!sk->sk_net_refcnt`
> 2. This early timer termination prevents proper reference counting resolution
> when connections don't complete the 4-way TCP termination handshake
> 3. The resulting socket references are never fully released, leading to a leak
>
> #### Timeline of Related Changes
>
> 1. v4.2-rc1 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets")
> - Added `sk_net_refcnt` field to distinguish user sockets (=1) from kernel sockets (=0)
> - Kernel sockets don't hold netns references, which can lead to potential UAF issues
>
> 2. v6.9-rc2 151c9c724d05: ("tcp: properly terminate timers for kernel sockets")
> - Modified `tcp_close()` to check `sk->sk_net_refcnt` and explicitly terminate timers for kernel sockets (=0)
> - This prevents UAF when netns is destroyed before socket timers complete
> - **Key change**: If `!sk->sk_net_refcnt`, call `inet_csk_clear_xmit_timers_sync()`
>
> 3. v6.12-rc7 ef7134c7fc48: ("smb: client: Fix use-after-free of network namespace")
> - Fixed netns UAF in CIFS by manually setting `sk->sk_net_refcnt = 1`
> - Also called `maybe_get_net()` to maintain netns references
> - This effectively made kernel sockets behave like user sockets for reference counting
>
> 4. v6.13-rc4 e9f2517a3e18: ("smb: client: fix TCP timers deadlock after rmmod")
> - Problem commit: Removed `sk->sk_net_refcnt = 1` setting
> - Changed to using explicit `get_net()/put_net()` at CIFS layer
> - This change leads to socket leaks because timers are terminated early
>
> ### Lockdep Warning Analysis
>
> I've also investigated the lockdep warning mentioned in the patch. My analysis
> suggests it may be a false positive rather than an actual deadlock.
Note that the 'deadlock' in the description is simply wrong as I mentioned
before. The 'deadlock' means a lock which belongs to an unloaded module,
and not actual deadlock like circular locking etc.
https://lore.kernel.org/netdev/20241219084100.33837-1-kuniyu@amazon.com/
> The crash
> actually occurs in the lockdep subsystem itself (null pointer dereference in
> `check_wait_context()`), not in the CIFS or networking code directly.
>
> The procedure for the null pointer dereference is as follows:
>
> When lockdep is enabled, the lock class "slock-AF_INET-CIFS" is set when a socket
> connection is established.
>
> ```
> cifs_do_mount
> cifs_mount
> mount_get_conns
> cifs_get_tcp_session
> ip_connect
> generic_ip_connect
> cifs_reclassify_socket4
> sock_lock_init_class_and_name(sk, "slock-AF_INET-CIFS",
> lockdep_init_map
> lockdep_init_map_wait
> lockdep_init_map_type
> lockdep_init_map_type
> register_lock_class
> __set_bit(class - lock_classes, lock_classes_in_use);
> ```
>
> When the module is unloaded, the lock class is cleaned up.
>
> ```
> free_mod_mem
> lockdep_free_key_range
> __lockdep_free_key_range
> zap_class
> __clear_bit(class - lock_classes, lock_classes_in_use);
> ```
>
> After the module is uninstalled and the network connection is restored, the
> timer is woken up.
>
> ```
> run_timer_softirq
> run_timer_base
> __run_timers
> call_timer_fn
> tcp_write_timer
> bh_lock_sock
> spin_lock(&((__sk)->sk_lock.slock))
> _raw_spin_lock
> lock_acquire
> __lock_acquire
> check_wait_context
> hlock_class
> if (!test_bit(class_idx, lock_classes_in_use)) {
> return NULL;
> hlock_class(next)->wait_type_inner; // Null pointer dereference
> ```
>
> The problem lies within lockdep, as Kuniyuki says:
>
> > I tried the repro and confirmed it triggers null deref.
> >
> > It happens in LOCKDEP internal, so for me it looks like a problem in
> > LOCKDEP rather than CIFS or TCP.
> >
> > I think LOCKDEP should hold a module reference and prevent related
> > modules from being unloaded.
>
> Regarding the deadlock issue, it is clear that the locks mentioned in the deadlock warning
> do not belong to the same lock instance. A deadlock should not occur.
>
> ### Discussion Points
>
> 1. API Design Question: Is this fundamentally an issue with how CIFS uses the socket
> API, or is it a networking layer issue that should handle socket cleanup differently?
>
> 2. Approach to Resolution: Would it be better to:
> - Revert to the original solution (setting `sk->sk_net_refcnt = 1`) from ef7134c7fc48?
> - Work with networking subsystem maintainers on a more comprehensive solution that
> handles socket cleanup properly?
Could you test this patch with e9f2517a3e18 reverted ?
---8<---
diff --git a/include/net/sock.h b/include/net/sock.h
index 8daf1b3b12c6..1e15a1209ea6 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -547,6 +547,10 @@ struct sock {
struct rcu_head sk_rcu;
netns_tracker ns_tracker;
struct xarray sk_user_frags;
+
+#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(MODULE)
+ struct module *sk_owner;
+#endif
};
struct sock_bh_locked {
@@ -1583,6 +1587,15 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
sk_mem_reclaim(sk);
}
+#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(MODULE)
+static inline void sk_set_owner(struct sock *sk, struct module *owner)
+{
+ sk->sk_owner = module_get(owner);
+}
+#else
+#define sk_set_owner(sk, owner)
+#endif
+
/*
* Macro so as to not evaluate some arguments when
* lockdep is not enabled.
@@ -1592,6 +1605,7 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
*/
#define sock_lock_init_class_and_name(sk, sname, skey, name, key) \
do { \
+ sk_set_owner(sk, THIS_MODULE); \
sk->sk_lock.owned = 0; \
init_waitqueue_head(&sk->sk_lock.wq); \
spin_lock_init(&(sk)->sk_lock.slock); \
diff --git a/net/core/sock.c b/net/core/sock.c
index 323892066def..2d91a92ed26d 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2324,6 +2324,11 @@ static void __sk_destruct(struct rcu_head *head)
__netns_tracker_free(net, &sk->ns_tracker, false);
net_passive_dec(net);
}
+
+#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(MODULE)
+ if (sk->sk_owner)
+ module_put(sk->sk_owner);
+#endif
sk_prot_free(sk->sk_prot_creator, sk);
}
---8<---
>
> 3. CVE Process Question: Given that CVE-2024-54680 appears to "fix" a non-existent issue
> while introducing an actual vulnerability, what's the appropriate way to address this?
>
> What's the best path forward?
>
> Best regards,
> Wang Zhaolong
>
> > Adding fsdevel and networking in case any thoughts on this fix for
> > network/namespaces refcount issue (that causes rmmod UAF).
> >
> > Any opinions on Enzo's proposed Fix?
> >
> > ---------- Forwarded message ---------
> > From: Steve French <smfrench@...il.com>
> > Date: Tue, Dec 17, 2024 at 9:24 PM
> > Subject: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod
> > To: CIFS <linux-cifs@...r.kernel.org>
> > Cc: Kuniyuki Iwashima <kuniyu@...zon.com>, Enzo Matsumiya <ematsumiya@...e.de>
> >
> >
> > Enzo had an interesting patch, that seems to fix an important problem.
> >
> > Here was his repro scenario:
> >
> > tw:~ # mount.cifs -o credentials=/root/wincreds,echo_interval=10
> > //someserver/target1 /mnt/test
> > tw:~ # ls /mnt/test
> > abc dir1 dir3 target1_file.txt tsub
> > tw:~ # iptables -A INPUT -s someserver -j DROP
> >
> > Trigger reconnect and wait for 3*echo_interval:
> >
> > tw:~ # cat /mnt/test/target1_file.txt
> > cat: /mnt/test/target1_file.txt: Host is down
> >
> > Then umount and rmmod. Note that rmmod might take several iterations
> > until it properly tears down everything, so make sure you see the "not
> > loaded" message before proceeding:
> >
> > tw:~ # umount /mnt/*; rmmod cifs
> > umount: /mnt/az: not mounted.
> > umount: /mnt/dfs: not mounted.
> > umount: /mnt/local: not mounted.
> > umount: /mnt/scratch: not mounted.
> > rmmod: ERROR: Module cifs is in use
> > ...
> > tw:~ # rmmod cifs
> > rmmod: ERROR: Module cifs is not currently loaded
> >
> > Then kickoff the TCP internals:
> > tw:~ # iptables -F
> >
> > Gets the lockdep warning (requires CONFIG_LOCKDEP=y) + a NULL deref
> > later on.
> >
> >
> > Any thoughts on his patch? See below (and attached)
> >
> > Commit ef7134c7fc48 ("smb: client: Fix use-after-free of network
> > namespace.")
> > fixed a netns UAF by manually enabled socket refcounting
> > (sk->sk_net_refcnt=1 and sock_inuse_add(net, 1)).
> >
> > The reason the patch worked for that bug was because we now hold
> > references to the netns (get_net_track() gets a ref internally)
> > and they're properly released (internally, on __sk_destruct()),
> > but only because sk->sk_net_refcnt was set.
> >
> > Problem:
> > (this happens regardless of CONFIG_NET_NS_REFCNT_TRACKER and regardless
> > if init_net or other)
> >
> > Setting sk->sk_net_refcnt=1 *manually* and *after* socket creation is not
> > only out of cifs scope, but also technically wrong -- it's set conditionally
> > based on user (=1) vs kernel (=0) sockets. And net/ implementations
> > seem to base their user vs kernel space operations on it.
> >
> > e.g. upon TCP socket close, the TCP timers are not cleared because
> > sk->sk_net_refcnt=1:
> > (cf. commit 151c9c724d05 ("tcp: properly terminate timers for
> > kernel sockets"))
> >
> > net/ipv4/tcp.c:
> > void tcp_close(struct sock *sk, long timeout)
> > {
> > lock_sock(sk);
> > __tcp_close(sk, timeout);
> > release_sock(sk);
> > if (!sk->sk_net_refcnt)
> > inet_csk_clear_xmit_timers_sync(sk);
> > sock_put(sk);
> > }
> >
> > Which will throw a lockdep warning and then, as expected, deadlock on
> > tcp_write_timer().
> >
> > A way to reproduce this is by running the reproducer from ef7134c7fc48
> > and then 'rmmod cifs'. A few seconds later, the deadlock/lockdep
> > warning shows up.
> >
> > Fix:
> > We shouldn't mess with socket internals ourselves, so do not set
> > sk_net_refcnt manually.
> >
> > Also change __sock_create() to sock_create_kern() for explicitness.
> >
> > As for non-init_net network namespaces, we deal with it the best way
> > we can -- hold an extra netns reference for server->ssocket and drop it
> > when it's released. This ensures that the netns still exists whenever
> > we need to create/destroy server->ssocket, but is not directly tied to
> > it.
> >
> > Fixes: ef7134c7fc48 ("smb: client: Fix use-after-free of network
> > namespace.")
> >
> >
> > --
> > Thanks,
> >
> > Steve
Powered by blists - more mailing lists