[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51AFA677.9010605@candelatech.com>
Date: Wed, 05 Jun 2013 13:58:31 -0700
From: Ben Greear <greearb@...delatech.com>
To: Tejun Heo <tj@...nel.org>
CC: Rusty Russell <rusty@...tcorp.com.au>,
Joe Lawrence <joe.lawrence@...atus.com>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
stable@...r.kernel.org
Subject: Re: stop_machine lockup issue in 3.9.y.
On 06/05/2013 12:31 PM, Ben Greear wrote:
> This is no longer really about the module unlink, so changing
> subject.
>
> On 06/05/2013 12:11 PM, Ben Greear wrote:
>> On 06/05/2013 11:48 AM, Tejun Heo wrote:
>>> Hello, Ben.
>>>
>>> On Wed, Jun 05, 2013 at 09:59:00AM -0700, Ben Greear wrote:
>>>> One pattern I notice repeating for at least most of the hangs is that all but one
>>>> CPU thread has irqs disabled and is in state 2. But, there will be one thread
>>>> in state 1 that still has IRQs enabled and it is reported to be in soft-lockup
>>>> instead of hard-lockup. In 'sysrq l' it always shows some IRQ processing,
>>>> but typically that of the sysrq itself. I added printk that would always
>>>> print if the thread notices that smdata->state != curstate, and the soft-lockup
>>>> thread (cpu 2 below) never shows that message.
>>>
>>> It sounds like one of the cpus get live-locked by IRQs. I can't tell
>>> why the situation is made worse by other CPUs being tied up. Do you
>>> ever see CPUs being live locked by IRQs during normal operation?
Hmm, wonder if I found it. I previously saw times where it appears
jiffies does not increment. __do_softirq has a break-out based on
jiffies timeout. Maybe that is failing to get us out of __do_softirq
in my lockup case because for whatever reason the system cannot update
jiffies in this case?
I added this (probably whitespace damaged) hack and now I have not been
able to reproduce the problem.
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 14d7758..621ea3b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -212,6 +212,7 @@ asmlinkage void __do_softirq(void)
unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
int cpu;
unsigned long old_flags = current->flags;
+ unsigned long loops = 0;
/*
* Mask out PF_MEMALLOC s current task context is borrowed for the
@@ -241,6 +242,7 @@ restart:
unsigned int vec_nr = h - softirq_vec;
int prev_count = preempt_count();
+ loops++;
kstat_incr_softirqs_this_cpu(vec_nr);
trace_softirq_entry(vec_nr);
@@ -265,7 +267,7 @@ restart:
pending = local_softirq_pending();
if (pending) {
- if (time_before(jiffies, end) && !need_resched())
+ if (time_before(jiffies, end) && !need_resched() && (loops < 500))
goto restart;
wakeup_softirqd();
Thanks,
Ben
--
Ben Greear <greearb@...delatech.com>
Candela Technologies Inc http://www.candelatech.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists