linux-kernel - Re: softlockup: automatically detect hung TASK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <20080206171230.72a058ae.akpm@linux-foundation.org>
Date:	Wed, 6 Feb 2008 17:12:30 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Ingo Molnar <mingo@...e.hu>
Cc:	a.p.zijlstra@...llo.nl, linux-kernel@...r.kernel.org,
	ego@...ibm.com
Subject: Re: softlockup: automatically detect hung TASK_UNINTERRUPTIBLE
 tasks

On Thu, 7 Feb 2008 01:51:10 +0100
Ingo Molnar <mingo@...e.hu> wrote:

> 
> * Andrew Morton <akpm@...ux-foundation.org> wrote:
> 
> > Nope.
> > 
> > But I tested it on mainline, and mainline exhibits the 
> > never-powers-off symptom, whereas 
> > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 demonstrates the 
> > powers-off-after-20-seconds symptom.
> > 
> > So we _may_ be dealing with two bugs here, and your patch might have 
> > fixed the first, but that success is obscured by the second.  I guess 
> > I need to prepare a tree which has 
> > ed50d6cbc394cd0966469d3e249353c9dd1d38b9 at its tip.  (Wonders how to 
> > do that).
> 
> the way i do it in bisection is to do:
> 
>   mkdir patches
>   git-log -1 -p ed50d6cbc394cd0966469d3 > patches/fix.patch
>   echo fix.patch > patches/series
> 
> and then before testing a bisection point, i do a 'quilt push'. Before 
> telling git-bisect about the quality of that bisection point (good/bad) 
> i pop it off via 'quilt pop'.
> 
> this way the 'required fix' can be kept during the bisection, to find 
> the secondary bug.
> 
> > btw, mainline (plus this patch, not that it changed anything) prints
> > 
> > <stopping disk stuff>
> > Disabling non-boot CPUs
> > CPU 1 is now offline
> > 
> > and that's it.   This machine has eight cpus.  Might be a hint?
> 
> what should be the proper message?

Seems that it should be a stream of eight

CPU n is now offline
CPU n down

> my suspects, besides there being something wrong in the hung-tasks code 
> of the softlockup watchdog, would be the cpu-hotplug commits, or some 
> arch/x86 commit. (although we didnt really have anything specifically 
> touching the the reboot path)
> 
> does a stupid patch like the one below tell you more about what the 
> other CPUs are doing during this hang? [32-bit only patch]
> 
> 	Ingo
> 
> ---
>  arch/i386/kernel/nmi.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> Index: linux/arch/i386/kernel/nmi.c
> ===================================================================
> --- linux.orig/arch/x86/kernel/nmi_64.c
> +++ linux/arch/x86/kernel/nmi_64.c
> @@ -331,6 +331,14 @@ __kprobes int nmi_watchdog_tick(struct p
>  	int touched = 0;
>  	int cpu = smp_processor_id();
>  	int rc=0;
> +	static int count[NR_CPUS];
> +
> +	if (!count[cpu]) {
> +		count[cpu] = nmi_hz;
> +		printk("CPU#%d, tick\n", cpu);
> +		show_regs(regs);
> +	}
> +	count[cpu]--;
>  
>  	/* check for other users first */
>  	if (notify_die(DIE_NMI, "nmi", regs, reason, 2, SIGINT)

I reworked that on top of ed50d6cbc394cd0966469d3e249353c9dd1d38b9: no
change.

However I watched the vga console this time (nothing is coming over
netconsole at this stage) I saw this:


CPU 1 is now offline
<10 second pause>
CPU 1 is down
CPU 2 is now offline
CPU 2 is down
CPU 3 is now offline
CPU 3 is down
CPU 4 is now offline
<10 second pause>

followed by a quick spew of the remaining CPUs going down and offline then
poweroff.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/