linux-kernel - Re: Temporary random kernel hang

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20061209230746.7e33b40f.akpm@osdl.org>
Date:	Sat, 9 Dec 2006 23:07:46 -0800
From:	Andrew Morton <akpm@...l.org>
To:	seven <horia.muntean@...il.com>
Cc:	linux-kernel@...r.kernel.org
Subject: Re: Temporary random kernel hang

> On Fri, 8 Dec 2006 02:38:45 -0800 (PST) seven <horia.muntean@...il.com> wrote:
> 
> Hello,
> 
> I have some trouble with a multithreaded java network server running on
> SLES10. At random times I see the kernel take 80% of the CPU leaving iddle
> to 0% for 30 seconds. After this period the system returns to normal
> operation state.
> 
> Below is a vmstat -a 3 recording that shows the problem:
> 
>  1  0      0 773068 529184 693048    0    0     0     0  272  201  0  0 100 
> 0  0
>  0  0      0 773068 529184 693064    0    0     0    25  317  334  1  0 99 
> 1  0
>  0  0      0 772944 529216 693248    0    0     0    24  477 1017  3  0 96 
> 0  0
>  0  0      0 772820 529256 693316    0    0     0     0  525 1376  4  1 95 
> 0  0
>  0  0      0 772448 529344 693636    0    0     0   107 1098 3306 11  2 86 
> 0  0
>  0  0      0 772324 529404 693456    0    0     0     0  723 2247  7  2 91 
> 0  0
>  0  0      0 772076 529496 693656    0    0     0   132  770 2488  7  2 91 
> 1  0
>  0  0      0 772200 529528 693608    0    0     0    91  528 1168  4  1 94 
> 1  0
>  0  0      0 772200 529532 693728    0    0     0     0  334  387  1  0 99 
> 0  0
>  0  0      0 772076 529568 693680    0    0     0    24  564 1250  4  1 95 
> 0  0
>  0  0      0 771828 529636 693784    0    0     0     0  787 2144  7  2 91 
> 0  0
>  0  0      0 771580 529744 694232    0    0     0   111  995 3081 11  2 86 
> 1  0
> 107  0      0 771316 529792 694904    0    0     0   153  829 1650 12 37 51 
> 0  0
> 113  0      0 771316 529792 694912    0    0     0     0  323  169 15 85  0 
> 0  0
> 116  0      0 771216 529792 694728    0    0     0    25  292  190 14 86  0 
> 0  0
> 122  0      0 771340 529792 694728    0    0     0    21  311  191 15 85  0 
> 0  0
> 138  0      0 771464 529792 694728    0    0     0     0  365  196 14 86  0 
> 0  0
> 146  0      0 771464 529792 694728    0    0     0     0  331  189 16 84  0 
> 0  0
> 150  0      0 771472 529792 694728    0    0     0     0  336  183 15 85  0 
> 0  0
> 146  0      0 771472 529792 694728    0    0     0     4  310  201 14 86  0 
> 0  0
> 145  0      0 771472 529792 694728    0    0     0     0  285  163 15 85  0 
> 0  0
> procs -----------memory---------- ---swap-- -----io---- -system--
> -----cpu------
>  r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id
> wa st
> 146  0      0 771472 529792 694728    0    0     0     0  277  159 14 86  0 
> 0  0
> 145  0      0 771472 529792 694728    0    0     0    32  275  133 15 85  0 
> 0  0
>  0  0      0 771208 529892 694176    0    0     0     0 1012 3408 12  4 84 
> 0  0
>  0  0      0 770712 529972 694488    0    0     0   149  774 2869  8  2 90 
> 0  0
>  0  0      0 770712 529972 694488    0    0     0     0  271  195  0  0 100 
> 0  0
>  0  0      0 770728 529972 694488    0    0     0    35  269  167  0  0 100 
> 1  0
>  0  0      0 770728 529972 694488    0    0     0     7  269  189  0  0 100 
> 0  0
> 
> The application is memory stable ( no leaks ) and a deadlock is out of the
> question since in a deadlock case the system would freeze forever and not
> temporarily. There are around 200 - 250 tcp/ip clients connected to the
> application and 550 threads ( streaming blocking sockets are used so every
> client is managed by one reading thread and one writing thread)
> 
> The same application works fine on SLES9.3
> 
> Hanging Evironment:
> -----------------------------------------------------------------------------
> mustang:~ # uname -a
> Linux mustang 2.6.16.21-0.25-smp #1 SMP Tue Sep 19 07:26:15 UTC 2006 x86_64
> x86_64 x86_64 GNU/Linux
> mustang:~ # java -version
> java version "1.6.0-rc"
> Java(TM) SE Runtime Environment (build 1.6.0-rc-b104)
> Java HotSpot(TM) Server VM (build 1.6.0-rc-b104, mixed mode)
> mustang:~ # cat /etc/SuSE-release
> SUSE Linux Enterprise Server 10 (x86_64)
> VERSION = 10
> -----------------------------------------------------------------------------
> 
> Working environment:
> -----------------------------------------------------------------------------
> apollo:~ # uname -a
> Linux apollo 2.6.5-7.252-smp #1 SMP Tue Feb 14 11:11:04 UTC 2006 x86_64
> x86_64 x86_64 GNU/Linux
> apollo:~ # java -version
> java version "1.6.0-rc"
> Java(TM) SE Runtime Environment (build 1.6.0-rc-b95)
> Java HotSpot(TM) 64-Bit Server VM (build 1.6.0-rc-b95, mixed mode)
> apollo:~ # cat /etc/SuSE-release
> SUSE LINUX Enterprise Server 9 (x86_64)
> VERSION = 9
> PATCHLEVEL = 3
> -----------------------------------------------------------------------------
> 
> Can you give me some pointers about where to start debugging this issue?
> 

A kernel profile will tell us were the kernel is burning CPU.  Something
like this (run as root):

#!/bin/sh
while true
do
	opcontrol --stop
	opcontrol --shutdown
	rm -rf /var/lib/oprofile
	opcontrol --vmlinux=/boot/vmlinux-$(uname -r)
	opcontrol --start-daemon
	opcontrol --start
	date
	sleep 5
	opcontrol --stop
	opcontrol --shutdown
	opreport -l /boot/vmlinux-$(uname -r) | head -50
done | tee /tmp/oprofile-output

Then locate the output record which correlates with one of these episodes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/