[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130410151041.GA6670@roeck-us.net>
Date: Wed, 10 Apr 2013 08:10:41 -0700
From: Guenter Roeck <linux@...ck-us.net>
To: Don Zickus <dzickus@...hat.com>
Cc: Dave Young <dyoung@...hat.com>, linux-watchdog@...r.kernel.org,
kexec@...ts.infradead.org, wim@...ana.be,
LKML <linux-kernel@...r.kernel.org>, vgoyal@...hat.com
Subject: Re: [RFC PATCH] watchdog: Add hook for kicking in kdump path
On Wed, Apr 10, 2013 at 10:20:55AM -0400, Don Zickus wrote:
> On Wed, Apr 10, 2013 at 06:51:23AM -0700, Guenter Roeck wrote:
> > On Wed, Apr 10, 2013 at 09:40:39AM -0400, Don Zickus wrote:
> > > On Tue, Apr 09, 2013 at 09:07:58AM -0700, Guenter Roeck wrote:
> > > > > > Just look for the use of mod_timer in the watchdog directory.
> > > > >
> > > > > So looking at the mod_timer logic in various drivers, it seems regardless
> > > > > if the /dev/watchdog device is opened or not, if it is running, it will
> > > > > automagically kick the watchdog.
> > > > >
> > > > yes
> > > >
> > > > > This seems that we can avoid pulling in userspace pieces for this. Just
> > > > > load the driver and the hardware starts getting kicked.
> > > > >
> > > > Only if it is already running. Also, you don't want to rely on it, because you
> > > > lose protection against user space issues.
> > >
> > > IOW if something goes wrong with a runaway userspace app, the kernel
> > > blindly continues to kick the watchdog, which masks the problem, right?
> > >
> > That would be wrong if any of the drivers does that. The kernel should stop
> > kicking after the software timeout expires.
> >
> > For example, if the HW needs to be kicked every second, and the high level
> > timeout is set to one minute, the driver should keep kicking the hardware
> > watchdog for one minute and then stop doing it if /dev/watchdog was opened
> > and userspace is silent.
>
> Ah ok.
>
> >
> > > >
> > > > A second use is if the hw watchdog needs to be pinged more often than user
> > > > space can provide. Some of the HW watchdogs need a ping in one-second intervals
> > > > or even faster.
> > > >
> > > > > Is that true? And if so, do all drivers detect if the hardware is already
> > > > > running during their init? Or is it based on the first device open?
> > > > >
> > > > It is usually done in the probe function.
> > >
> > > Ok. Thanks for the understanding of how the softdog stuff works.
> > >
> > > However, we still have the problem that if the machine panics and we want
> > > to jump into the kdump kernel, we need to 'kick' the watchdog one more
> > > time. This provides us a sane sync point for determining how long we have
> > > to load the watchdog driver in the second kernel before the hardware
> > > reboots us. Otherwise the reboots are pretty random and nothing is
> > > guaranteed.
> > >
> > > Hence the need for some sort of patch resembling the one I posted.
> > >
> > > Soooooooo, any thoughts about that patch and what changes I should make?
> > > :-)
> > >
> > The FIXME is a problem, and I think the name and scope would have to be
> > more generic (watchdog_kick ?). Also, it doesn't solve the problem
> > of having multiple open watchdogs (my system has three, for example),
> > and it doesn't check if the watchdog is running.
>
> Ok. I didn't know the watchdog subsystem well enough, so I just took
> stabs in the dark about how things should work. I appreciate the
> feedback.
>
> I could make the name more generic. I wasn't sure if the watchdog
> community would frown on that. The FIXME is a problem, I am not sure how
I don't know what others think, but I would prefer a more generic name.
That its current use is for kdump is besides the point; there could be
other valid uses.
> to handle the 'fail' scenario (can't get the mutex with trylock). And I
I would just return without doing anything. After all, if the mutex is held,
it is possible if not likely that the code crashed _in_ the watchdog code,
so it might be better to just let it crash and burn in that case.
> have no idea how to even find out if multiple watchdogs are open on the
> system. Is there a list I could walk? And with regard to 'watchdog is
/* the dev_t structure to store the dynamically allocated watchdog devices */
static dev_t watchdog_devt;
One way to look up the allocated watchdogs might be to loop through all kobj
instances for the major device using kobj_lookup. Don't know if there is a
better way.
> running', I thought 'watchdog_active' would do that. But again, I could
> be misreading the code.
>
You are right. Missed that part, sorry.
Guenter
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists