linux-kernel - Re: stop_machine lockup issue in 3.9.y.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <51AFA677.9010605@candelatech.com>
Date:	Wed, 05 Jun 2013 13:58:31 -0700
From:	Ben Greear <greearb@...delatech.com>
To:	Tejun Heo <tj@...nel.org>
CC:	Rusty Russell <rusty@...tcorp.com.au>,
	Joe Lawrence <joe.lawrence@...atus.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
	stable@...r.kernel.org
Subject: Re: stop_machine lockup issue in 3.9.y.

On 06/05/2013 12:31 PM, Ben Greear wrote:
> This is no longer really about the module unlink, so changing
> subject.
>
> On 06/05/2013 12:11 PM, Ben Greear wrote:
>> On 06/05/2013 11:48 AM, Tejun Heo wrote:
>>> Hello, Ben.
>>>
>>> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
>>>> One pattern I notice repeating for at least most of the hangs is that all but one
>>>> CPU thread has irqs disabled and is in state 2.  But, there will be one thread
>>>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
>>>> instead of hard-lockup.  In 'sysrq l' it always shows some IRQ processing,
>>>> but typically that of the sysrq itself.  I added printk that would always
>>>> print if the thread notices that smdata->state != curstate, and the soft-lockup
>>>> thread (cpu 2 below) never shows that message.
>>>
>>> It sounds like one of the cpus get live-locked by IRQs.  I can't tell
>>> why the situation is made worse by other CPUs being tied up.  Do you
>>> ever see CPUs being live locked by IRQs during normal operation?

Hmm, wonder if I found it.  I previously saw times where it appears
jiffies does not increment.  __do_softirq has a break-out based on
jiffies timeout.  Maybe that is failing to get us out of __do_softirq
in my lockup case because for whatever reason the system cannot update
jiffies in this case?

I added this (probably whitespace damaged) hack and now I have not been
able to reproduce the problem.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 14d7758..621ea3b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
         unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
         int cpu;
         unsigned long old_flags = current->flags;
+       unsigned long loops = 0;

         /*
          * Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -241,6 +242,7 @@ restart:
                         unsigned int vec_nr = h - softirq_vec;
                         int prev_count = preempt_count();

+                       loops++;
                         kstat_incr_softirqs_this_cpu(vec_nr);

                         trace_softirq_entry(vec_nr);
@@ -265,7 +267,7 @@ restart:

         pending = local_softirq_pending();
         if (pending) {
-               if (time_before(jiffies, end) && !need_resched())
+               if (time_before(jiffies, end) && !need_resched() && (loops < 500))
                         goto restart;

                 wakeup_softirqd();

Thanks,
Ben

-- 
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc  http://www.candelatech.com

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/