lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 6 Jun 2013 13:48:21 -0700
From:	Tejun Heo <tj@...nel.org>
To:	greearb@...delatech.com
Cc:	linux-kernel@...r.kernel.org, eric.dumazet@...il.com
Subject: Re: [PATCH v2] Fix lockup related to stop_machine being stuck in
 __do_softirq.

On Thu, Jun 06, 2013 at 09:10:06AM -0700, greearb@...delatech.com wrote:
> From: Ben Greear <greearb@...delatech.com>
> 
> The stop machine logic can lock up if all but one of
> the migration threads make it through the disable-irq
> step and the one remaining thread gets stuck in
> __do_softirq.  The reason __do_softirq can hang is
> that it has a bail-out based on jiffies timeout, but
> in the lockup case, jiffies itself is not incremented.
> 
> To work around this, re-add the max_restart counter in __do_irq
> and stop processing irqs after 10 restarts.
> 
> Thanks to Tejun Heo and Rusty Russell and others for
> helping me track this down.
> 
> This was introduced in 3.9 by commit:  c10d73671ad30f5469
> (softirq:  reduce latencies).
> 
> It may be worth looking into ath9k to see if it has issues with
> it's irq handler at a later date.
> 
> The hang stack traces look something like this:

Oops, you already posted the second version.  :)

>  /*
> - * We restart softirq processing for at most 2 ms,
> - * and if need_resched() is not set.
> + * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
> + * but break the loop if need_resched() is set or after 2 ms.
>   *
>   * These limits have been established via experimentation.
>   * The two things to balance is latency against fairness -
> @@ -204,6 +204,7 @@ EXPORT_SYMBOL(local_bh_enable_ip);
>   * should not be able to lock up the box.
>   */
>  #define MAX_SOFTIRQ_TIME  msecs_to_jiffies(2)
> +#define MAX_SOFTIRQ_RESTART 10

As wrote before, a brief explanation on why both are necessary would
be nice.  Something like - "the time limit prevents from introducing
excessive latency from softirq handling and the loop limit protects
against softirq runaway which may happen during stop_machine - see
XXX".

Please cc Linus and also cc stable@...r.kernel.org.  We definitely
want this backported.

Thanks a lot!

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ