linux-kernel - Re: Fwd: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac39f5a1-664a-4812-bb50-ceb9771d1d66@huawei.com>
Date: Tue, 1 Apr 2025 21:54:47 +0800
From: Wang Zhaolong <wangzhaolong1@...wei.com>
To: Steve French <smfrench@...il.com>, linux-fsdevel
	<linux-fsdevel@...r.kernel.org>, <linux-net@...r.kernel.org>, LKML
	<linux-kernel@...r.kernel.org>, Enzo Matsumiya <ematsumiya@...e.de>,
	<kuniyu@...zon.com>, <edumazet@...gle.com>
CC: <zhangchangzhong@...wei.com>
Subject: Re: Fwd: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod

Hi.

My colleagues and I have been investigating the issue addressed by this patch
and have discovered some significant concerns that require broader discussion.

### Socket Leak Issue

After testing this patch extensively, I've confirmed it introduces a socket leak
when TCP connections don't complete proper termination (e.g., when FIN packets
are dropped). The leak manifests as a continuous increase in TCP slab usage:

I've documented this issue with a reproducer in Bugzilla:

https://bugzilla.kernel.org/show_bug.cgi?id=219972#c0

The key issue appears to stem from the interaction between the SMB client and TCP
socket lifecycle management:

1. Removing `sk->sk_net_refcnt = 1` causes TCP timers to be terminated early in
    `tcp_close()` via the `inet_csk_clear_xmit_timers_sync()` call when
    `!sk->sk_net_refcnt`
2. This early timer termination prevents proper reference counting resolution
    when connections don't complete the 4-way TCP termination handshake
3. The resulting socket references are never fully released, leading to a leak

#### Timeline of Related Changes

1. v4.2-rc1 26abe14379f8 ("net: Modify sk_alloc to not reference count the netns of kernel sockets")
    - Added `sk_net_refcnt` field to distinguish user sockets (=1) from kernel sockets (=0)
    - Kernel sockets don't hold netns references, which can lead to potential UAF issues

2. v6.9-rc2 151c9c724d05: ("tcp: properly terminate timers for kernel sockets")
    - Modified `tcp_close()` to check `sk->sk_net_refcnt` and explicitly terminate timers for kernel sockets (=0)
    - This prevents UAF when netns is destroyed before socket timers complete
    - **Key change**: If `!sk->sk_net_refcnt`, call `inet_csk_clear_xmit_timers_sync()`

3. v6.12-rc7 ef7134c7fc48: ("smb: client: Fix use-after-free of network namespace")
    - Fixed netns UAF in CIFS by manually setting `sk->sk_net_refcnt = 1`
    - Also called `maybe_get_net()` to maintain netns references
    - This effectively made kernel sockets behave like user sockets for reference counting

4. v6.13-rc4 e9f2517a3e18: ("smb: client: fix TCP timers deadlock after rmmod")
    - Problem commit: Removed `sk->sk_net_refcnt = 1` setting
    - Changed to using explicit `get_net()/put_net()` at CIFS layer
    - This change leads to socket leaks because timers are terminated early

### Lockdep Warning Analysis

I've also investigated the lockdep warning mentioned in the patch. My analysis
suggests it may be a false positive rather than an actual deadlock. The crash
actually occurs in the lockdep subsystem itself (null pointer dereference in
`check_wait_context()`), not in the CIFS or networking code directly.

The procedure for the null pointer dereference is as follows:

When lockdep is enabled, the lock class "slock-AF_INET-CIFS" is set when a socket
connection is established.

```
cifs_do_mount
   cifs_mount
     mount_get_conns
       cifs_get_tcp_session
         ip_connect
           generic_ip_connect
             cifs_reclassify_socket4
               sock_lock_init_class_and_name(sk, "slock-AF_INET-CIFS",
                 lockdep_init_map
                   lockdep_init_map_wait
                     lockdep_init_map_type
                       lockdep_init_map_type
                         register_lock_class
                           __set_bit(class - lock_classes, lock_classes_in_use);
```

When the module is unloaded, the lock class is cleaned up.

```
free_mod_mem
   lockdep_free_key_range
     __lockdep_free_key_range
       zap_class
         __clear_bit(class - lock_classes, lock_classes_in_use);
```

After the module is uninstalled and the network connection is restored, the
timer is woken up.

```
run_timer_softirq
   run_timer_base
     __run_timers
       call_timer_fn
         tcp_write_timer
           bh_lock_sock
             spin_lock(&((__sk)->sk_lock.slock))
               _raw_spin_lock
                 lock_acquire
                   __lock_acquire
                     check_wait_context
                       hlock_class
                        if (!test_bit(class_idx, lock_classes_in_use)) {
                           return NULL;
                       hlock_class(next)->wait_type_inner; // Null pointer dereference
```

The problem lies within lockdep, as Kuniyuki says:

> I tried the repro and confirmed it triggers null deref.
> 
> It happens in LOCKDEP internal, so for me it looks like a problem in
> LOCKDEP rather than CIFS or TCP.
> 
> I think LOCKDEP should hold a module reference and prevent related
> modules from being unloaded.

Regarding the deadlock issue, it is clear that the locks mentioned in the deadlock warning
do not belong to the same lock instance. A deadlock should not occur.

### Discussion Points

1. API Design Question: Is this fundamentally an issue with how CIFS uses the socket
    API, or is it a networking layer issue that should handle socket cleanup differently?

2. Approach to Resolution: Would it be better to:
    - Revert to the original solution (setting `sk->sk_net_refcnt = 1`) from ef7134c7fc48?
    - Work with networking subsystem maintainers on a more comprehensive solution that
      handles socket cleanup properly?

3.  CVE Process Question: Given that CVE-2024-54680 appears to "fix" a non-existent issue
     while introducing an actual vulnerability, what's the appropriate way to address this?

What's the best path forward?

Best regards,
Wang Zhaolong

> Adding fsdevel and networking in case any thoughts on this fix for
> network/namespaces refcount issue (that causes rmmod UAF).
> 
> Any opinions on Enzo's proposed Fix?
> 
> ---------- Forwarded message ---------
> From: Steve French <smfrench@...il.com>
> Date: Tue, Dec 17, 2024 at 9:24 PM
> Subject: [PATCH][SMB3 client] fix TCP timers deadlock after rmmod
> To: CIFS <linux-cifs@...r.kernel.org>
> Cc: Kuniyuki Iwashima <kuniyu@...zon.com>, Enzo Matsumiya <ematsumiya@...e.de>
> 
> 
> Enzo had an interesting patch, that seems to fix an important problem.
> 
> Here was his repro scenario:
> 
>       tw:~ # mount.cifs -o credentials=/root/wincreds,echo_interval=10
> //someserver/target1 /mnt/test
>       tw:~ # ls /mnt/test
>       abc  dir1  dir3  target1_file.txt  tsub
>       tw:~ # iptables -A INPUT -s someserver -j DROP
> 
> Trigger reconnect and wait for 3*echo_interval:
> 
>       tw:~ # cat /mnt/test/target1_file.txt
>       cat: /mnt/test/target1_file.txt: Host is down
> 
> Then umount and rmmod.  Note that rmmod might take several iterations
> until it properly tears down everything, so make sure you see the "not
> loaded" message before proceeding:
> 
>       tw:~ # umount /mnt/*; rmmod cifs
>       umount: /mnt/az: not mounted.
>       umount: /mnt/dfs: not mounted.
>       umount: /mnt/local: not mounted.
>       umount: /mnt/scratch: not mounted.
>       rmmod: ERROR: Module cifs is in use
>       ...
>       tw:~ # rmmod cifs
>       rmmod: ERROR: Module cifs is not currently loaded
> 
> Then kickoff the TCP internals:
>       tw:~ # iptables -F
> 
> Gets the lockdep warning (requires CONFIG_LOCKDEP=y) + a NULL deref
> later on.
> 
> 
> Any thoughts on his patch?  See below (and attached)
> 
>      Commit ef7134c7fc48 ("smb: client: Fix use-after-free of network
> namespace.")
>      fixed a netns UAF by manually enabled socket refcounting
>      (sk->sk_net_refcnt=1 and sock_inuse_add(net, 1)).
> 
>      The reason the patch worked for that bug was because we now hold
>      references to the netns (get_net_track() gets a ref internally)
>      and they're properly released (internally, on __sk_destruct()),
>      but only because sk->sk_net_refcnt was set.
> 
>      Problem:
>      (this happens regardless of CONFIG_NET_NS_REFCNT_TRACKER and regardless
>      if init_net or other)
> 
>      Setting sk->sk_net_refcnt=1 *manually* and *after* socket creation is not
>      only out of cifs scope, but also technically wrong -- it's set conditionally
>      based on user (=1) vs kernel (=0) sockets.  And net/ implementations
>      seem to base their user vs kernel space operations on it.
> 
>      e.g. upon TCP socket close, the TCP timers are not cleared because
>      sk->sk_net_refcnt=1:
>      (cf. commit 151c9c724d05 ("tcp: properly terminate timers for
> kernel sockets"))
> 
>      net/ipv4/tcp.c:
>          void tcp_close(struct sock *sk, long timeout)
>          {
>              lock_sock(sk);
>              __tcp_close(sk, timeout);
>              release_sock(sk);
>              if (!sk->sk_net_refcnt)
>                      inet_csk_clear_xmit_timers_sync(sk);
>              sock_put(sk);
>          }
> 
>      Which will throw a lockdep warning and then, as expected, deadlock on
>      tcp_write_timer().
> 
>      A way to reproduce this is by running the reproducer from ef7134c7fc48
>      and then 'rmmod cifs'.  A few seconds later, the deadlock/lockdep
>      warning shows up.
> 
>      Fix:
>      We shouldn't mess with socket internals ourselves, so do not set
>      sk_net_refcnt manually.
> 
>      Also change __sock_create() to sock_create_kern() for explicitness.
> 
>      As for non-init_net network namespaces, we deal with it the best way
>      we can -- hold an extra netns reference for server->ssocket and drop it
>      when it's released.  This ensures that the netns still exists whenever
>      we need to create/destroy server->ssocket, but is not directly tied to
>      it.
> 
>      Fixes: ef7134c7fc48 ("smb: client: Fix use-after-free of network
> namespace.")
> 
> 
> --
> Thanks,
> 
> Steve
> 
>