lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130605211157.GK10693@mtj.dyndns.org>
Date:	Wed, 5 Jun 2013 14:11:57 -0700
From:	Tejun Heo <tj@...nel.org>
To:	Ben Greear <greearb@...delatech.com>
Cc:	Rusty Russell <rusty@...tcorp.com.au>,
	Joe Lawrence <joe.lawrence@...atus.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	stable@...r.kernel.org,
	"Luis R. Rodriguez" <mcgrof@....qualcomm.com>,
	Jouni Malinen <jouni@....qualcomm.com>,
	Vasanthakumar Thiagarajan <vthiagar@....qualcomm.com>,
	Senthil Balasubramanian <senthilb@....qualcomm.com>,
	linux-wireless@...r.kernel.org, ath9k-devel@...ts.ath9k.org,
	Thomas Gleixner <tglx@...utronix.de>,
	Ingo Molnar <mingo@...hat.com>
Subject: Re: stop_machine lockup issue in 3.9.y.

(cc'ing wireless crowd, tglx and Ingo.  The original thread is at
 http://thread.gmane.org/gmane.linux.kernel/1500158/focus=55005 )

Hello, Ben.

On Wed, Jun 05, 2013 at 01:58:31PM -0700, Ben Greear wrote:
> Hmm, wonder if I found it.  I previously saw times where it appears
> jiffies does not increment.  __do_softirq has a break-out based on
> jiffies timeout.  Maybe that is failing to get us out of __do_softirq
> in my lockup case because for whatever reason the system cannot update
> jiffies in this case?
> 
> I added this (probably whitespace damaged) hack and now I have not been
> able to reproduce the problem.

Ah, nice catch. :)

> diff --git a/kernel/softirq.c b/kernel/softirq.c
> index 14d7758..621ea3b 100644
> --- a/kernel/softirq.c
> +++ b/kernel/softirq.c
> @@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
>         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
>         int cpu;
>         unsigned long old_flags = current->flags;
> +       unsigned long loops = 0;
> 
>         /*
>          * Mask out PF_MEMALLOC s current task context is borrowed for the
> @@ -241,6 +242,7 @@ restart:
>                         unsigned int vec_nr = h - softirq_vec;
>                         int prev_count = preempt_count();
> 
> +                       loops++;
>                         kstat_incr_softirqs_this_cpu(vec_nr);
> 
>                         trace_softirq_entry(vec_nr);
> @@ -265,7 +267,7 @@ restart:
> 
>         pending = local_softirq_pending();
>         if (pending) {
> -               if (time_before(jiffies, end) && !need_resched())
> +               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
>                         goto restart;

So, softirq most likely kicked off from ath9k is rescheduling itself
to the extent where it ends up locking out the CPU completely.  The
problem is usually okay because the processing would break out in 2ms
but as jiffies is stopped in this case with all other CPUs trapped in
stop_machine, the loop never breaks and the machine hangs.  While
adding the counter limit probably isn't a bad idea, softirq requeueing
itself indefinitely sounds pretty buggy.

ath9k people, do you guys have any idea what's going on?  Why would
softirq repeat itself indefinitely?

Ingo, Thomas, we're seeing a stop_machine hanging because

* All other CPUs entered IRQ disabled stage.  Jiffies is not being
  updated.

* The last CPU get caught up executing softirq indefinitely.  As
  jiffies doesn't get updated, it never breaks out of softirq
  handling.  This is a deadlock.  This CPU won't break out of softirq
  handling unless jiffies is updated and other CPUs can't do anything
  until this CPU enters the same stop_machine stage.

Ben found out that breaking out of softirq handling after certain
number of repetitions makes the issue go away, which isn't a proper
fix but we might want anyway.  What do you guys think?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ