lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1396111046.2898.88.camel@deadeye.wl.decadent.org.uk>
Date:	Sat, 29 Mar 2014 16:37:26 +0000
From:	Ben Hutchings <ben@...adent.org.uk>
To:	Russell King - ARM Linux <linux@....linux.org.uk>
Cc:	netdev@...r.kernel.org
Subject: Re: Proper suspend/resume flow

On Wed, 2014-03-26 at 11:58 +0000, Russell King - ARM Linux wrote:
> I've been looking at the Freescale FEC driver recently, and I've noticed
> that it's suspend path looks like this:
> 
>         if (netif_running(ndev)) {
>                 netif_device_detach(ndev);
>                 fec_stop(ndev);
>         }
> 
> I believe this to be unsafe, because if the device is active, then we may
> have packets in the transmit ring.  It may be possible for the ring to be
> cleaned between netif_device_detach() and fec_stop(), resulting in a call
> to netif_wake_queue(), which doesn't seem to take account of whether the
> device has been detached.
>
> I wondered if this was just limited to the freescale driver, and I found
> a similar pattern in e1000e - __e1000e_shutdown() calls netif_device_detach()
> as the very first thing, before doing anything else.  This seems to be a
> common pattern.  8139cp.c does this, the second call seems to be rather
> redundant:
> 
>         netif_device_detach (dev);
>         netif_stop_queue (dev);
> 
> Tulip on the other hand shuts the hardware down and cleans the transmit
> ring first before calling netif_device_detach(), thus ensuring that there
> won't be any transmit completions before this call.  It doesn't stop the
> queue before this, so presumably in a SMP environment, it is possible for
> packets to get queued after the transmit ring has been cleaned.
> 
> So, what is the correct approach to suspending a net device?  When should
> netif_device_detach() be called?
[...]

My advice here is based on Solarflare's extensive stress-testing of the
sfc driver control path and the race conditions it uncovered.  It did
not cover suspend/resume, but it does cover MTU and ring size changes
which involve the same sort of stop/start while the device is still
marked as running.

- Given the way the watchdog works, I think netif_device_detach() must
be called first if you are going to stop TX queues in a control
operation other than ndo_stop (commits e4abce853849, 29c69a488264).

- The driver must check netif_device_present() before calling
netif_wake_queue().  Perhaps netif_wake_queue() itself should check
that, if many drivers need to do it.

- Calls to netif_device_detach() may need to be synchronised with the TX
scheduler (commits c2f3b8e3a44b, 35205b211c8d).  To support this, the
function efx_device_detach_sync() maybe should be turned into a general
netif_device_detach_sync() (and then possibly needs to disable IRQs).

Ben.

-- 
Ben Hutchings
[W]e found...that it wasn't as easy to get programs right as we had thought.
... I realized that a large part of my life from then on was going to be spent
in finding mistakes in my own programs. - Maurice Wilkes, 1949

Download attachment "signature.asc" of type "application/pgp-signature" (812 bytes)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ