[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a51bcd2d-d1c6-4516-90c1-f6c50ce01f9f@nvidia.com>
Date: Tue, 16 Dec 2025 15:59:32 +0200
From: Michael Gur <michaelgur@...dia.com>
To: wujing <realwujing@...com>, jgg@...pe.ca
Cc: leon@...nel.org, linux-kernel@...r.kernel.org,
linux-rdma@...r.kernel.org, yuanql9@...natelecom.cn
Subject: Re: [PATCH] IB/core: Fix ABBA deadlock in rdma_dev_exit_net
On 12/16/2025 11:59 AM, wujing wrote:
> Hi Jason,
>
> You're right that the locks aren't nested in rdma_dev_exit_net() - it does release
> rdma_nets_rwsem before acquiring devices_rwsem. However, this is still an ABBA deadlock,
> just not the trivial nested kind. The issue is caused by **rwsem writer priority**
> and lock ordering inconsistency.
>
> Here's the actual deadlock scenario:
>
> **Thread A (rdma_dev_exit_net - cleanup_net workqueue):**
> ```
> down_write(&rdma_nets_rwsem); // Acquired
> xa_store(&rdma_nets, ...);
> up_write(&rdma_nets_rwsem); // Released
> down_read(&devices_rwsem); // Waiting here <-- BLOCKED
> ```
>
> **Thread B (rdma_dev_init_net - stress-ng-clone):**
> ```
> down_read(&devices_rwsem); // Acquired
> down_read(&rdma_nets_rwsem); // Waiting here <-- BLOCKED
> ```
>
> The deadlock happens because:
>
> 1. Thread A releases rdma_nets_rwsem as a **writer**
> 2. Thread B (and many others) are waiting to acquire rdma_nets_rwsem as **readers**
> 3. Thread A then tries to acquire devices_rwsem as a reader
> 4. BUT: rwsem gives priority to pending writers over new readers
> 5. Since Thread A was a pending writer on rdma_nets_rwsem, Thread B's read request is blocked
> 6. Thread B holds devices_rwsem, which Thread A needs
> 7. Thread A holds the "writer priority slot" on rdma_nets_rwsem, which Thread B needs
>
Why would Thread A still hold any writer priority after calling up_write()?
The kernel log is also not consistent with this analysis, the thread
running rdma_dev_exit_net() is stuck on the down_write(), not on the
down_read().
Maybe what we have is a thread running some net namespace operation
while holding rdma_nets_rwsem and starving all other threads.
I'm not sure how many devices and namespaces we need to have so that we
get it to block for this long, but I'd assume it's possible when running
stress testing.
Powered by blists - more mailing lists