netdev - Re: [PATCH net v3] net: ethtool: do runtime PM outside RTNL

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9bcfb259-1249-4efc-b581-056fb0a1c144@gmail.com>
Date: Thu, 4 Jan 2024 10:05:12 +0100
From: Heiner Kallweit <hkallweit1@...il.com>
To: Stanislaw Gruszka <stanislaw.gruszka@...ux.intel.com>,
 Jakub Kicinski <kuba@...nel.org>
Cc: Johannes Berg <johannes@...solutions.net>, netdev@...r.kernel.org,
 Johannes Berg <johannes.berg@...el.com>, Marc MERLIN <marc@...lins.org>,
 Przemek Kitszel <przemyslaw.kitszel@...el.com>
Subject: Re: [PATCH net v3] net: ethtool: do runtime PM outside RTNL

On 04.01.2024 09:25, Stanislaw Gruszka wrote:
> On Wed, Jan 03, 2024 at 03:34:05PM -0800, Jakub Kicinski wrote:
>> On Wed, 3 Jan 2024 11:30:17 +0100 Stanislaw Gruszka wrote:
>>>> I was really, really hoping that this would serve as a motivation
>>>> for Intel to sort out the igb/igc implementation. The flow AFAICT
>>>> is ndo_open() starts the NIC, the calls pm_sus, which shuts the NIC
>>>> back down immediately (!?) then it schedules a link check from a work  
>>>
>>> It's not like that. pm_runtime_put() in igc_open() does not disable device.
>>> It calls runtime_idle callback which check if there is link and if is
>>> not, schedule device suspend in 5 second, otherwise device stays running.
>>
>> Hm, I missed the 5 sec delay there. Next question for me is - how does
>> it not deadlock in the open?
>>
>> igc_open()
>>   __igc_open(resuming=false)
>>     if (!resuming)
>>       pm_runtime_get_sync(&pdev->dev);
>>
>> igc_resume()
>>   rtnl_lock()
> 
> If device was not suspended, pm_runtime_get_sync() will increase
> dev->power.usage_count counter and cancel pending rpm suspend
> request if any. There is race condition though, more about that
> below.
> 
> If device was suspended, we could not get to igc_open() since it
> was marked as detached and fail netif_device_present() check in
> __dev_open(). That was the behaviour before bd869245a3dc.
> 
> There is small race window between with igc_open() and scheduled
> runtime suspend, if at the same time dev_open() is done and
> dev->power.suspend_timer expire:
> 
> open:					pm_suspend_timer_fh:
> 
> rtnl_lock()
> 					rpm_suspend()
> 					  igc_runtime_suspend()
> 					   __igc_shutdown()
> 					     rtnl_lock()
> 
> __igc_open()
>   pm_runtime_get_sync():
>     waits for rpm suspend callback done
> 
> This needs to be addressed, but it's not that this can happen
> all the time. To trigger this someone has to remove the
> cable and exactly after 5 seconds do ip link set up. 
> 
For me the main question is the following. In igc_resume() you have

	rtnl_lock();
	if (!err && netif_running(netdev))
		err = __igc_open(netdev, true);

	if (!err)
		netif_device_attach(netdev);
	rtnl_unlock();

Why is the global rtnl_lock() needed here? The netdev is in detached
state what protects from e.g. userspace activity, see all the
netif_device_present() checks in net core.

> Regards
> Stanislaw