linux-kernel - Re: [feature] automatically detect hung TASK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20071202211925.GA26414@one.firstfloor.org>
Date:	Sun, 2 Dec 2007 22:19:25 +0100
From:	Andi Kleen <andi@...stfloor.org>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	Andi Kleen <andi@...stfloor.org>,
	Arjan van de Ven <arjan@...radead.org>,
	linux-kernel@...r.kernel.org,
	Andrew Morton <akpm@...ux-foundation.org>,
	Thomas Gleixner <tglx@...utronix.de>
Subject: Re: [feature] automatically detect hung TASK_UNINTERRUPTIBLE tasks

On Sun, Dec 02, 2007 at 10:10:27PM +0100, Ingo Molnar wrote:
> what if you considered - just for a minute - the possibility of this 
> debug tool being the thing that actually animates developers to fix such 
> long delay bugs that have bothered users for almost a decade meanwhile?

Throwing frequent debugging messages for non buggy cases will
just lead to people generally ignore softlockups.

I don't think runtime instrumentation is the way to introduce
TASK_KILLABLE in general. The only way there is people going through
the source and identify places where it makes sense.

> 
> Until now users had little direct recourse to get such problems fixed. 
> (we had sysrq-t, but that included no real metric of how long a task was 

Actually task delay accounting can measure this now.  iirc someone
had a latencytop based on it already.

> blocked, so there was no direct link in the typical case and users had 
> no real reliable tool to express their frustration about unreasonable 
> delays.)
> 
> Now this changes: they get a "smoking gun" backtrace reported by the 
> kernel, and blamed on exactly the place that caused that unreasonable 
> delay. And it's not like the kernel breaks - at most 10 such messages 
> are reported per bootup.
> 
> We increase the delay timeout to say 300 seconds, and if the system is 
> under extremely high IO load then 120+ might be a reasonable delay, so 
> it's all tunable and runtime disable-able anyway. So if you _know_ that 
> you will see and tolerate such long delays, you can tweak it - but i can 

This means the user has to see their kernel log fill by such
messages at least once - do a round trip to some mailing list to 
explain that it is expected and not a kernel bug - then tweak
some obscure parameters. Doesn't seem like a particular fruitful
procedure to me.

> tell you with 100% certainty that 99.9% of the typical Linux users do 
> not characterize such long delays as "correct behavior".

It's about robustness, not the typical case.
Throwing backtraces when something slightly unusual happens is not a robust system.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/