[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAM_iQpXXdpdKvVY4G=y8=R4TsYE0ovac=OCNfiaMmD=Rgn2utQ@mail.gmail.com>
Date: Sat, 27 Jun 2020 11:55:21 -0700
From: Cong Wang <xiyou.wangcong@...il.com>
To: Sean Tranchetti <stranche@...eaurora.org>
Cc: David Miller <davem@...emloft.net>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
Pravin B Shelar <pshelar@....org>,
Subash Abhinov Kasiviswanathan <subashab@...eaurora.org>
Subject: Re: [PATCH net] genetlink: take netlink table lock when (un)registering
On Fri, Jun 26, 2020 at 5:32 PM Sean Tranchetti <stranche@...eaurora.org> wrote:
>
> A potential deadlock can occur during registering or unregistering a new
> generic netlink family between the main nl_table_lock and the cb_lock where
> each thread wants the lock held by the other, as demonstrated below.
>
> 1) Thread 1 is performing a netlink_bind() operation on a socket. As part
> of this call, it will call netlink_lock_table(), incrementing the
> nl_table_users count to 1.
> 2) Thread 2 is registering (or unregistering) a genl_family via the
> genl_(un)register_family() API. The cb_lock semaphore will be taken for
> writing.
> 3) Thread 1 will call genl_bind() as part of the bind operation to handle
> subscribing to GENL multicast groups at the request of the user. It will
> attempt to take the cb_lock semaphore for reading, but it will fail and
> be scheduled away, waiting for Thread 2 to finish the write.
> 4) Thread 2 will call netlink_table_grab() during the (un)registration
> call. However, as Thread 1 has incremented nl_table_users, it will not
> be able to proceed, and both threads will be stuck waiting for the
> other.
>
> To avoid this scenario, the locks should be acquired in the same order by
> both threads. Since both the register and unregister functions need to take
> the nl_table_lock in their processing, it makes sense to explicitly acquire
> them before they lock the genl_mutex and the cb_lock. In unregistering, no
> other change is needed aside from this locking change.
Like the kernel test robot reported, you can not call genl_lock_all while
holding netlink_table_grab() which is effectively a write lock.
To me, it seems genl_bind() can be just removed as there is no one
in-tree uses family->mcast_bind(). Can you test the attached patch?
It seems sufficient to fix this deadlock.
Thanks.
View attachment "genetlink-mcast-bind.diff" of type "text/x-patch" (2716 bytes)
Powered by blists - more mailing lists