linux-kernel - [PATCH 0/4] Fix mid_q_entry memory leaks in SMB client

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250804134006.3609555-1-wangzhaolong@huaweicloud.com>
Date: Mon,  4 Aug 2025 21:40:02 +0800
From: Wang Zhaolong <wangzhaolong@...weicloud.com>
To: sfrench@...ba.org,
	pshilov@...rosoft.com
Cc: linux-cifs@...r.kernel.org,
	samba-technical@...ts.samba.org,
	linux-kernel@...r.kernel.org,
	chengzhihao1@...wei.com,
	wangzhaolong@...weicloud.com,
	yi.zhang@...wei.com,
	yangerkun@...wei.com
Subject: [PATCH 0/4] Fix mid_q_entry memory leaks in SMB client

I've been investigating a pretty nasty memory leak in the SMB client. When
compound requests get interrupted by signals, we end up with mid_q_entry
structures and server buffers that never get freed[1].

User foreground process                    cifsd
cifs_readdir
 open_cached_dir
  cifs_send_recv
   compound_send_recv
    smb2_setup_request
     smb2_mid_entry_alloc
      smb2_get_mid_entry
       smb2_mid_entry_alloc
        mempool_alloc // alloc mid
        kref_init(&temp->refcount); // refcount = 1
     mid[0]->callback = cifs_compound_callback;
     mid[1]->callback = cifs_compound_last_callback;
     smb_send_rqst
     rc = wait_for_response
      wait_event_state TASK_KILLABLE
                                  cifs_demultiplex_thread
                                    allocate_buffers
                                      server->bigbuf = cifs_buf_get()
                                    standard_receive3
                                      ->find_mid()
                                        smb2_find_mid
                                          __smb2_find_mid
                                           kref_get(&mid->refcount) // +1
                                      cifs_handle_standard
                                        handle_mid
                                         /* bigbuf will also leak */
                                         mid->resp_buf = server->bigbuf
                                         server->bigbuf = NULL;
                                         dequeue_mid
                                     /* in for loop */
                                    mids[0]->callback
                                      cifs_compound_callback
    /* Signal interrupts wait: rc = -ERESTARTSYS */
    /* if (... || midQ[i]->mid_state == MID_RESPONSE_RECEIVED) *?
    midQ[0]->callback = cifs_cancelled_callback;
    cancelled_mid[i] = true;
                                       /* The change comes too late */
                                       mid->mid_state = MID_RESPONSE_READY
                                    release_mid  // -1
    /* cancelled_mid[i] == true causes mid won't be released
       in compound_send_recv cleanup */
    /* cifs_cancelled_callback won't executed to release mid */

The core issue is a race condition where cifs_cancelled_callback never
gets a chance to run, so cleanup never happens. I've spent quite a bit
of time trying to understand how to fix this safely.

Honestly, my first instinct was to just patch the callback assignment
logic directly. But the more I dug into it, the more I realized that
the current locking scheme makes this really tricky to do safely. We
have one big lock protecting multiple different things, and trying to
fix the race condition directly felt like playing with fire.

I kept running into scenarios where a "simple" fix could introduce
deadlocks or new race conditions. After looking at this from different
angles, I came to the conclusion that I needed to refactor the locking
first to create a safe foundation for the actual fix.

Patches 1-3 are foundational refactoring. These three patches rename
locks for clarity, separate counter protection from queue operations,
and replace the confusing mid_flags bitmask with explicit boolean
fields. Basically, they untangle the current locking mess so I can
implement the real fix without breaking anything.

The 4th patch in the series is where the real fix happens. With
the previous refactoring in place, I could safely add a lock to each
mid_q_entry and implement atomic callback execution. This eliminates
the race condition that was causing the leaks.

In summary, my approach to the fix is to use smaller-grained locking to
avoid race conditions. However, during the implementation process,
this approach involves more changes than I initially hoped for. If
there's a simpler or more elegant way to fix this race condition that
I've missed, I'd love to hear about it. I've tried to be thorough in
my analysis, but I know there are folks with more experience in this
codebase who might see a better path.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=220404 [1]

Wang Zhaolong (4):
  smb: client: rename server mid_lock to mid_queue_lock
  smb: client: add mid_counter_lock to protect the mid counter counter
  smb: client: smb: client: eliminate mid_flags field
  smb: client: fix mid_q_entry memleak leak with per-mid locking

 fs/smb/client/cifs_debug.c    | 12 ++++--
 fs/smb/client/cifsglob.h      | 23 +++++++-----
 fs/smb/client/connect.c       | 57 +++++++++++++++++-----------
 fs/smb/client/smb1ops.c       | 23 ++++++++----
 fs/smb/client/smb2ops.c       | 69 ++++++++++++++++++----------------
 fs/smb/client/smb2transport.c |  5 ++-
 fs/smb/client/transport.c     | 71 +++++++++++++++++++----------------
 7 files changed, 150 insertions(+), 110 deletions(-)

-- 
2.39.2