lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <e87f2ab2-d8c3-403a-a35e-46ebf063b626@intel.com>
Date: Thu, 6 Feb 2025 09:18:57 -0800
From: Jacob Keller <jacob.e.keller@...el.com>
To: Jakub Kicinski <kuba@...nel.org>
CC: "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: iAVF circular lock dependency due to netdev_lock



On 2/5/2025 8:32 PM, Jakub Kicinski wrote:
> On Wed, 5 Feb 2025 18:27:40 -0800 Jacob Keller wrote:
>>> Not sure either, the locking in this driver is quite odd. Do you know
>>> why it's registering the netdev from a workqueue, and what the entry
>>> points to the driver are?
>>
>> Yes, the locking in iAVF has been problematic for years :(
>>
>> We register the netdevice from a work queue because we are waiting on
>> messages from the PF over virtchnl. I don't fully understand the
>> motivation behind the way the initialization was moved into a work
>> queue, but this appears to be the historical reasoning from examining
>> commit logs.
>>
>>> Normally before the netdev is registered it can't get called, so all 
>>> the locking is moot. But IDK if we need to protect from some FW
>>> interactions, maybe?  
>>
>> We had a lot of issues with locking and pain getting to the state we are
>> in today. I think part of the challenge is that the VF is communicating
>> asynchronously over virtchnl queue messages to the PF for setup.
>>
>> Its a mess :( I could re-order the locks so we go "RTNL -> crit_lock ->
>> netdev_lock" but that will only work as long as no flow from the kernel
>> does something like "RTNL -> netdev_lock -> <driver callback that takes
>> crit lock>" which seems unlikely :(
>>
>> Its a mess and I don't quite know how to dig out of it.
> 
> Stanislav suggested off list that we could add a _locked() version of
> register_netdevice(). I'm worried that it's just digging a deeper hole.
> We'd cover with the lock parts of core which weren't covered before.
> 

Right. We could add one, but I think that gets pretty tricky as well.

> Maybe the saving grace is that the driver appears to never transition
> out of the registered state. And netdev lock only protects the core.
> So we could elide taking the netdev lock if device is not registered
> yet? We'd still need to convince lockdep that its okay to take the
> netdev lock under crit lock once.
> 
> Code below is incomplete but hopefully it will illustrate. 
> Key unanswered question is how to explain this to lockdep..
> 

This might work but it seems also quite ugly, especially if we have to
make lockdep happy with it somehow.

One other option would be to simply release the crit_lock the same way
we are already releasing the dev lock just before calling register. That
leaves a possible gap where other code could lock and then do something
in the window where things are unlocked.

I'm investigating whether thats even possible given the work queue
design and flow during initialization of the driver. Hopefully I'll have
an answer soon.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ