[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20071202211925.GA26414@one.firstfloor.org>
Date: Sun, 2 Dec 2007 22:19:25 +0100
From: Andi Kleen <andi@...stfloor.org>
To: Ingo Molnar <mingo@...e.hu>
Cc: Andi Kleen <andi@...stfloor.org>,
Arjan van de Ven <arjan@...radead.org>,
linux-kernel@...r.kernel.org,
Andrew Morton <akpm@...ux-foundation.org>,
Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [feature] automatically detect hung TASK_UNINTERRUPTIBLE tasks
On Sun, Dec 02, 2007 at 10:10:27PM +0100, Ingo Molnar wrote:
> what if you considered - just for a minute - the possibility of this
> debug tool being the thing that actually animates developers to fix such
> long delay bugs that have bothered users for almost a decade meanwhile?
Throwing frequent debugging messages for non buggy cases will
just lead to people generally ignore softlockups.
I don't think runtime instrumentation is the way to introduce
TASK_KILLABLE in general. The only way there is people going through
the source and identify places where it makes sense.
>
> Until now users had little direct recourse to get such problems fixed.
> (we had sysrq-t, but that included no real metric of how long a task was
Actually task delay accounting can measure this now. iirc someone
had a latencytop based on it already.
> blocked, so there was no direct link in the typical case and users had
> no real reliable tool to express their frustration about unreasonable
> delays.)
>
> Now this changes: they get a "smoking gun" backtrace reported by the
> kernel, and blamed on exactly the place that caused that unreasonable
> delay. And it's not like the kernel breaks - at most 10 such messages
> are reported per bootup.
>
> We increase the delay timeout to say 300 seconds, and if the system is
> under extremely high IO load then 120+ might be a reasonable delay, so
> it's all tunable and runtime disable-able anyway. So if you _know_ that
> you will see and tolerate such long delays, you can tweak it - but i can
This means the user has to see their kernel log fill by such
messages at least once - do a round trip to some mailing list to
explain that it is expected and not a kernel bug - then tweak
some obscure parameters. Doesn't seem like a particular fruitful
procedure to me.
> tell you with 100% certainty that 99.9% of the typical Linux users do
> not characterize such long delays as "correct behavior".
It's about robustness, not the typical case.
Throwing backtraces when something slightly unusual happens is not a robust system.
-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists