[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac39f5a1-664a-4812-bb50-ceb9771d1d66@huawei.com>
Date: Tue, 1 Apr 2025 21:54:47 +0800
From: Wang Zhaolong <wangzhaolong1@...wei.com>
To: Steve French <smfrench@...il.com>, linux-fsdevel
<linux-fsdevel@...r.kernel.org>, <linux-net@...r.kernel.org>, LKML
<linux-kernel@...r.kernel.org>, Enzo Matsumiya <ematsumiya@...e.de>,
<kuniyu@...zon.com>, <edumazet@...gle.com>
CC: <zhangchangzhong@...wei.com>
Subject: Re: Fwd: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod
Hi.
My colleagues and I have been investigating the issue addressed by this patch
and have discovered some significant concerns that require broader discussion.
### Socket Leak Issue
After testing this patch extensively, I've confirmed it introduces a socket leak
when TCP connections don't complete proper termination (e.g., when FIN packets
are dropped). The leak manifests as a continuous increase in TCP slab usage:
I've documented this issue with a reproducer in Bugzilla:
https://bugzilla.kernel.org/show_bug.cgi?id=219972#c0
The key issue appears to stem from the interaction between the SMB client and TCP
socket lifecycle management:
1. Removing `sk->sk_net_refcnt = 1` causes TCP timers to be terminated early in
`tcp_close()` via the `inet_csk_clear_xmit_timers_sync()` call when
`!sk->sk_net_refcnt`
2. This early timer termination prevents proper reference counting resolution
when connections don't complete the 4-way TCP termination handshake
3. The resulting socket references are never fully released, leading to a leak
#### Timeline of Related Changes
1. v4.2-rc1 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets")
- Added `sk_net_refcnt` field to distinguish user sockets (=1) from kernel sockets (=0)
- Kernel sockets don't hold netns references, which can lead to potential UAF issues
2. v6.9-rc2 151c9c724d05: ("tcp: properly terminate timers for kernel sockets")
- Modified `tcp_close()` to check `sk->sk_net_refcnt` and explicitly terminate timers for kernel sockets (=0)
- This prevents UAF when netns is destroyed before socket timers complete
- **Key change**: If `!sk->sk_net_refcnt`, call `inet_csk_clear_xmit_timers_sync()`
3. v6.12-rc7 ef7134c7fc48: ("smb: client: Fix use-after-free of network namespace")
- Fixed netns UAF in CIFS by manually setting `sk->sk_net_refcnt = 1`
- Also called `maybe_get_net()` to maintain netns references
- This effectively made kernel sockets behave like user sockets for reference counting
4. v6.13-rc4 e9f2517a3e18: ("smb: client: fix TCP timers deadlock after rmmod")
- Problem commit: Removed `sk->sk_net_refcnt = 1` setting
- Changed to using explicit `get_net()/put_net()` at CIFS layer
- This change leads to socket leaks because timers are terminated early
### Lockdep Warning Analysis
I've also investigated the lockdep warning mentioned in the patch. My analysis
suggests it may be a false positive rather than an actual deadlock. The crash
actually occurs in the lockdep subsystem itself (null pointer dereference in
`check_wait_context()`), not in the CIFS or networking code directly.
The procedure for the null pointer dereference is as follows:
When lockdep is enabled, the lock class "slock-AF_INET-CIFS" is set when a socket
connection is established.
```
cifs_do_mount
cifs_mount
mount_get_conns
cifs_get_tcp_session
ip_connect
generic_ip_connect
cifs_reclassify_socket4
sock_lock_init_class_and_name(sk, "slock-AF_INET-CIFS",
lockdep_init_map
lockdep_init_map_wait
lockdep_init_map_type
lockdep_init_map_type
register_lock_class
__set_bit(class - lock_classes, lock_classes_in_use);
```
When the module is unloaded, the lock class is cleaned up.
```
free_mod_mem
lockdep_free_key_range
__lockdep_free_key_range
zap_class
__clear_bit(class - lock_classes, lock_classes_in_use);
```
After the module is uninstalled and the network connection is restored, the
timer is woken up.
```
run_timer_softirq
run_timer_base
__run_timers
call_timer_fn
tcp_write_timer
bh_lock_sock
spin_lock(&((__sk)->sk_lock.slock))
_raw_spin_lock
lock_acquire
__lock_acquire
check_wait_context
hlock_class
if (!test_bit(class_idx, lock_classes_in_use)) {
return NULL;
hlock_class(next)->wait_type_inner; // Null pointer dereference
```
The problem lies within lockdep, as Kuniyuki says:
> I tried the repro and confirmed it triggers null deref.
>
> It happens in LOCKDEP internal, so for me it looks like a problem in
> LOCKDEP rather than CIFS or TCP.
>
> I think LOCKDEP should hold a module reference and prevent related
> modules from being unloaded.
Regarding the deadlock issue, it is clear that the locks mentioned in the deadlock warning
do not belong to the same lock instance. A deadlock should not occur.
### Discussion Points
1. API Design Question: Is this fundamentally an issue with how CIFS uses the socket
API, or is it a networking layer issue that should handle socket cleanup differently?
2. Approach to Resolution: Would it be better to:
- Revert to the original solution (setting `sk->sk_net_refcnt = 1`) from ef7134c7fc48?
- Work with networking subsystem maintainers on a more comprehensive solution that
handles socket cleanup properly?
3. CVE Process Question: Given that CVE-2024-54680 appears to "fix" a non-existent issue
while introducing an actual vulnerability, what's the appropriate way to address this?
What's the best path forward?
Best regards,
Wang Zhaolong
> Adding fsdevel and networking in case any thoughts on this fix for
> network/namespaces refcount issue (that causes rmmod UAF).
>
> Any opinions on Enzo's proposed Fix?
>
> ---------- Forwarded message ---------
> From: Steve French <smfrench@...il.com>
> Date: Tue, Dec 17, 2024 at 9:24 PM
> Subject: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod
> To: CIFS <linux-cifs@...r.kernel.org>
> Cc: Kuniyuki Iwashima <kuniyu@...zon.com>, Enzo Matsumiya <ematsumiya@...e.de>
>
>
> Enzo had an interesting patch, that seems to fix an important problem.
>
> Here was his repro scenario:
>
> tw:~ # mount.cifs -o credentials=/root/wincreds,echo_interval=10
> //someserver/target1 /mnt/test
> tw:~ # ls /mnt/test
> abc dir1 dir3 target1_file.txt tsub
> tw:~ # iptables -A INPUT -s someserver -j DROP
>
> Trigger reconnect and wait for 3*echo_interval:
>
> tw:~ # cat /mnt/test/target1_file.txt
> cat: /mnt/test/target1_file.txt: Host is down
>
> Then umount and rmmod. Note that rmmod might take several iterations
> until it properly tears down everything, so make sure you see the "not
> loaded" message before proceeding:
>
> tw:~ # umount /mnt/*; rmmod cifs
> umount: /mnt/az: not mounted.
> umount: /mnt/dfs: not mounted.
> umount: /mnt/local: not mounted.
> umount: /mnt/scratch: not mounted.
> rmmod: ERROR: Module cifs is in use
> ...
> tw:~ # rmmod cifs
> rmmod: ERROR: Module cifs is not currently loaded
>
> Then kickoff the TCP internals:
> tw:~ # iptables -F
>
> Gets the lockdep warning (requires CONFIG_LOCKDEP=y) + a NULL deref
> later on.
>
>
> Any thoughts on his patch? See below (and attached)
>
> Commit ef7134c7fc48 ("smb: client: Fix use-after-free of network
> namespace.")
> fixed a netns UAF by manually enabled socket refcounting
> (sk->sk_net_refcnt=1 and sock_inuse_add(net, 1)).
>
> The reason the patch worked for that bug was because we now hold
> references to the netns (get_net_track() gets a ref internally)
> and they're properly released (internally, on __sk_destruct()),
> but only because sk->sk_net_refcnt was set.
>
> Problem:
> (this happens regardless of CONFIG_NET_NS_REFCNT_TRACKER and regardless
> if init_net or other)
>
> Setting sk->sk_net_refcnt=1 *manually* and *after* socket creation is not
> only out of cifs scope, but also technically wrong -- it's set conditionally
> based on user (=1) vs kernel (=0) sockets. And net/ implementations
> seem to base their user vs kernel space operations on it.
>
> e.g. upon TCP socket close, the TCP timers are not cleared because
> sk->sk_net_refcnt=1:
> (cf. commit 151c9c724d05 ("tcp: properly terminate timers for
> kernel sockets"))
>
> net/ipv4/tcp.c:
> void tcp_close(struct sock *sk, long timeout)
> {
> lock_sock(sk);
> __tcp_close(sk, timeout);
> release_sock(sk);
> if (!sk->sk_net_refcnt)
> inet_csk_clear_xmit_timers_sync(sk);
> sock_put(sk);
> }
>
> Which will throw a lockdep warning and then, as expected, deadlock on
> tcp_write_timer().
>
> A way to reproduce this is by running the reproducer from ef7134c7fc48
> and then 'rmmod cifs'. A few seconds later, the deadlock/lockdep
> warning shows up.
>
> Fix:
> We shouldn't mess with socket internals ourselves, so do not set
> sk_net_refcnt manually.
>
> Also change __sock_create() to sock_create_kern() for explicitness.
>
> As for non-init_net network namespaces, we deal with it the best way
> we can -- hold an extra netns reference for server->ssocket and drop it
> when it's released. This ensures that the netns still exists whenever
> we need to create/destroy server->ssocket, but is not directly tied to
> it.
>
> Fixes: ef7134c7fc48 ("smb: client: Fix use-after-free of network
> namespace.")
>
>
> --
> Thanks,
>
> Steve
>
>
Powered by blists - more mailing lists