[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <29127567.mTXaJOMILH@vostro.rjw.lan>
Date: Sun, 12 May 2013 02:39:44 +0200
From: "Rafael J. Wysocki" <rjw@...k.pl>
To: Colin Cross <ccross@...roid.com>
Cc: Zoran Markovic <zoran.markovic@...aro.org>,
lkml <linux-kernel@...r.kernel.org>,
Linux PM list <linux-pm@...r.kernel.org>,
Benoit Goby <benoit@...roid.com>,
Android Kernel Team <kernel-team@...roid.com>,
Todd Poynor <toddpoynor@...gle.com>,
San Mehat <san@...gle.com>,
John Stultz <john.stultz@...aro.org>,
Pavel Machek <pavel@....cz>, Len Brown <len.brown@...el.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume.
On Friday, May 10, 2013 11:13:27 PM Colin Cross wrote:
> On Fri, May 10, 2013 at 2:28 PM, Zoran Markovic
> <zoran.markovic@...aro.org> wrote:
> > From: Benoit Goby <benoit@...roid.com>
> >
> > Below is a patch from android kernel that detects a driver suspend/resume
> > lockup and captures dump in the kernel log. Please review and provide
> > comments.
>
> This paragraph should go below the --- line so it doesn't end up in
> the final commit message.
>
> > Rather than hard-lock the kernel, dump the suspend/resume thread stack and
> > BUG() when a driver takes too long to suspend/resume.
And how exactly is that different from hanging the kernel?
> > The timeout is set to 12 seconds to be longer than the usbhid 10
> > second timeout.
Which is kind of arbitrary.
> > Exclude from the watchdog the time spent waiting for children that
> > are resumed asynchronously and time every device, whether or not they
> > resumed synchronously.
What about changing the code to use wait_for_completion_timeout() instead
of wait_for_completion() and actually trying to recover if one of them times
out? [It could try to unregister the device in question and if *that* hangs
indefinitely, *then* commit a panic() or something similar, but if it succeeds,
abort the ongoing suspend or complete the resume without that device.]
Of course, that would involve changing things not to depend on async, but
might be better than slapping a timer on top.
> > This patch is targeted for mobile devices where a suspend/resume lockup
> > could cause a system reboot and catch user's attention. Information
> > about failing device can later be retrieved from captured log in
> > subsequent boot session.
>
> I would take out the phrase "catch user's attention", the intention of
> this patch is actually the opposite - get the system back to working
> normally as fast as possible, while still putting enough information
> to debug the problem into the log.
>
> > The hardware watchdog timer is likely suspended during this time and
> > couldn't be relied upon. The soft-lockup detector would eventually tell
> > that tasks are not scheduled, but would provide little context as to why.
> > The patch hence uses system timer and assumes it is still active while the
> > devices are suspended/resumed.
I must say I'm not particularly impressed by this patch series. I understand
the motivation, but I'm wondering if that's the best we can do.
Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists