[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20080423.052928.142823681.davem@davemloft.net>
Date: Wed, 23 Apr 2008 05:29:28 -0700 (PDT)
From: David Miller <davem@...emloft.net>
To: mingo@...e.hu
Cc: linux-kernel@...r.kernel.org, tglx@...utronix.de,
a.p.zijlstra@...llo.nl
Subject: Re: [patch] softlockup: fix false positives on nohz if CPU is 100%
idle for more than 60 seconds
From: David Miller <davem@...emloft.net>
Date: Wed, 23 Apr 2008 03:55:44 -0700 (PDT)
> It may take some time, as each test run the verify the existence
> of the problem takes several minutes.
Ok, Ingo, none of your patches fix even the initial buggy
changeset, for reference:
commit 27ec4407790d075c325e1f4da0a19c56953cce23
Author: Ingo Molnar <mingo@...e.hu>
Date: Thu Feb 28 21:00:21 2008 +0100
sched: make cpu_clock() globally synchronous
Alexey Zaytsev reported (and bisected) that the introduction of
cpu_clock() in printk made the timestamps jump back and forth.
Make cpu_clock() more reliable while still keeping it fast when it's
called frequently.
Signed-off-by: Ingo Molnar <mingo@...e.hu>
I checked out a tree to the changeset before this one, just
to double check, and there are no problems.
I add that changeset and I get softlockup warnings like crazy
in my logs.
I added your "move touch_softlockup_watchdog() earlier in
tick_nohz_update_jiffies()" patch:
--------------------
Index: linux/kernel/time/tick-sched.c
===================================================================
--- linux.orig/kernel/time/tick-sched.c
+++ linux/kernel/time/tick-sched.c
@@ -133,8 +133,6 @@ void tick_nohz_update_jiffies(void)
if (!ts->tick_stopped)
return;
- touch_softlockup_watchdog();
-
cpu_clear(cpu, nohz_cpu_mask);
now = ktime_get();
ts->idle_waketime = now;
@@ -142,6 +140,8 @@ void tick_nohz_update_jiffies(void)
local_irq_save(flags);
tick_do_update_jiffies64(now);
local_irq_restore(flags);
+
+ touch_softlockup_watchdog();
}
void tick_nohz_stop_idle(int cpu)
--------------------
and still I get mountains of softlockup messages, see first
attachment, below.
I then added your patch, just to make sure, which adds the
missing prev_cpu_time assignment, specifically:
--------------------
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1001,6 +1001,8 @@ unsigned long long notrace cpu_clock(int
if (unlikely(delta_time > time_sync_thresh))
time = __sync_cpu_clock(time, cpu);
+ per_cpu(prev_cpu_time, cpu) = time;
+
return time;
}
EXPORT_SYMBOL_GPL(cpu_clock);
--------------------
Same problem, see second attachment, below.
But, to be honest, this is starting to become an exercise in futility.
None of your patches fix anything. Something is buggy about how your
new cpu_clock() stuff works. I'm trying to figure out when you're
going to finally at least go: "I can't figure out the problem, let's
revert until I have a better idea."
FWIW, I have a perfect globally synchronized TICK source on this
system.
And even with this fix there are so many other regressions that cause
similar spurious socklockup reports and even full on cpu hangs, all
seemingly added by the sched tree.
In my opinion this sched tree merge the other day is one of THE WORST
merges in recent memory. Linus's tree is currently a sizzling pile of
poo, I can't get any of my own merge work done, and I'm stuck here
hunting down regressions you've added because of it. :-/
We can't even get past one of the regressions added by that tree, and
it's been two days of my working on this non-stop.
View attachment "bug1.log" of type "Text/Plain" (75020 bytes)
View attachment "bug2.log" of type "Text/Plain" (87214 bytes)
Powered by blists - more mailing lists