[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120315161422.GC19855@tiehlicka.suse.cz>
Date: Thu, 15 Mar 2012 17:14:22 +0100
From: Michal Hocko <mhocko@...e.cz>
To: Don Zickus <dzickus@...hat.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
LKML <linux-kernel@...r.kernel.org>, Ingo Molnar <mingo@...e.hu>,
Peter Zijlstra <a.p.zijlstra@...llo.nl>,
Mandeep Singh Baines <msb@...omium.org>
Subject: Re: [PATCH] watchdog: Make sure the watchdog thread gets CPU on
loaded system
On Thu 15-03-12 11:54:13, Don Zickus wrote:
> On Thu, Mar 15, 2012 at 09:02:32AM +0100, Michal Hocko wrote:
> > On Wed 14-03-12 16:19:06, Andrew Morton wrote:
> > > On Wed, 14 Mar 2012 16:38:45 -0400
> > > Don Zickus <dzickus@...hat.com> wrote:
> > >
> > > > From: Michal Hocko <mhocko@...e.cz>
> > >
> > > This changelog is awful.
>
> My apologies too, Andrew for not being more diligent.
>
> Some nitpicks below (hopefully it isn't too picky :-( )
Thanks! Updated
---
>From a8da58750ba78d737136a4df24af805cb936ee00 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@...e.cz>
Date: Tue, 13 Mar 2012 10:34:44 +0100
Subject: [PATCH] watchdog: make sure the watchdog thread gets CPU on loaded
system
If the system is heavily loaded while hotplugging a CPU, we might end up
with a bogus hardlockup detection. This has been seen during LTP pounder
test executed in parallel with the hotplug test.
Hard lockup detector consist of two parts
- watchdog_overflow_callback (executed as a perf counter callback
from NMI) which checks whether per-cpu hrtimer_interrupts changed
since the last time it run and panics if not
- watchdog kernel thread which starts watchdog_hrtimer which
periodically updates hrtimer_interrupts.
The main problem is that watchdog_enable (called when a CPU is brought up)
registers a perf event but the hrtimer is started later when the watchdog
thread gets a chance to run.
The watchdog thread starts with a normal priority currently and boosts
itself as soon as it gets to a CPU. This might be, however, already too
late as demonstrated with the LTP pounder test executed in parallel by
LTP hotplug test. There are zillions of userspace processes sitting in
the runque while the number of online CPUs gets down to 1. CPUs are
onlined back in the second stage where the issue triggers.
When we online a CPU and create the watchdog kernel thread it will take
some time until it gets to a CPU. On the other hand the perf counter
callback is executed in the timely fashion so we explode the first time
it finds out that the hrtimer_interrupts wasn't incremented.
Let's fix this by boosting the watchdog thread priority before we wake it up
rather than when it's already running.
This still doesn't handle a case where we have the same amount of high prio
FIFO tasks but that doesn't seem to be common. The current implementation
doesn't handle that case anyway so this is no worse at least.
Unfortunately, we cannot start perf counter from the watchdog thread
because we could miss a real lock up and also we cannot start the
hrtimer from watchdog_enable because we there is no way (at least I
don't know any) to start a hrtimer from a different CPU.
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists