[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87ehe77lt2.fsf@xmission.com>
Date: Thu, 18 Apr 2013 11:09:29 -0700
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Don Zickus <dzickus@...hat.com>
Cc: linux-watchdog@...r.kernel.org, kexec@...ts.infradead.org,
wim@...ana.be, LKML <linux-kernel@...r.kernel.org>,
vgoyal@...hat.com, dyoung@...hat.com,
Guenter Roeck <linux@...ck-us.net>
Subject: Re: [PATCH v3] watchdog: Add hook for kicking in kdump path
Don Zickus <dzickus@...hat.com> writes:
> On Thu, Apr 18, 2013 at 09:35:05AM -0700, Eric W. Biederman wrote:
>> Don Zickus <dzickus@...hat.com> writes:
>>
>> > A common problem with kdump is that during the boot up of the
>> > second kernel, the hardware watchdog times out and reboots the
>> > machine before a vmcore can be captured.
>> >
>> > Instead of tellling customers to disable their hardware watchdog
>> > timers, I hacked up a hook to put in the kdump path that provides
>> > one last kick before jumping into the second kernel.
>>
>> Having thought about this a little more this patch is actively wrong.
>>
>> The problem is you can easily be petting the watchdog in violation of
>> whatever policy is normally in place. Which means that this extra
>> petting can result in a system that is unavailable for an unacceptably
>> long period of time.
>
> Not really, just an extra period which isn't that much. This would only
> be noticable if kdump is setup and enabled and then _hung_, otherwise it
> just quickly reboots and noone notices. :-)
For the folks who care the definition of acceptable unavailability would
look like: watchdog timeout + max boot time + margin of error. So it
is possible for an extra watchdog pet to eat up or exceed your margin
of error.
You are more likely to cause a how in the world did that happen than
something more extreme, but even playing invalidating peoples mental
model can be a problem sometimes.
>> I expect most watchdog policies are not that strict, but this patch
>> would preclude using those that are.
>
> I would assume most of those users would not enable kdump and would not be
> affected.
I have seen it be the case that the goal is to record what went wrong
if there is time, but to get back into service in a timely manner
regardless.
>> And like is being discussed in another subthread it does look like
>> changing the timeout and the interval should be enough all on it's own.
>
> Probably and I will pursue that. Thanks for the suggestion.
Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists