linux-kernel - high system cpu load during intense disk i/o

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <200708031903.10063.jimis@gmx.net>
Date:	Fri, 3 Aug 2007 18:03:09 +0200
From:	Dimitrios Apostolou <jimis@....net>
To:	linux-kernel@...r.kernel.org
Subject: high system cpu load during intense disk i/o

Hello list,

I have a P3, 256MB RAM system with 3 IDE disks attached, 2 identical 
ones as hda and hdc (primary and secondary master), and the disc with 
the OS partitions as primary slave hdb. For more info please refer to 
the attached dmesg.txt. I attach several oprofile outputs that describe 
various circumstances referenced later. The script I used to get them is 
the attached script.sh. 

The problem was encountered when I started two processes doing heavy I/O 
on hda and hdc, "badblocks -v -w /dev/hda" and "badblocks -v -w 
/dev/hdc". At the beginning (two_discs.txt) everything was fine and 
vmstat reported more than 90% iowait CPU load. However, after a while 
(two_discs_bad.txt) that some cron jobs kicked in, the image changed 
completely: the cpu load was now about 60% system, and the rest was user 
cpu load possibly going to the simple cron jobs.

Even though under normal circumstances (for example when running 
badblocks on only one disc (one_disc.txt)) the cron jobs finish almost 
instantaneously, this time they were simply never ending and every 10 
minutes or so more and more jobs were being added to the process table. 
One day later, the vmstat still reports about 60/40 system/user cpu load, all 
processes still run (hundreds of them), and the load average is huge!

Another day later the OOM killer kicks in and kills various processes, 
however never touches any badblocks process. Indeed, manually suspending 
one badblocks process remedies the situation: within some seconds the 
process table is cleared from cron jobs, cpu usage is back to 2-3% user 
and ~90% iowait and the system is normally responsive again. This 
happens no matter which badblocks process I suspend, being hda or hdc.

Any ideas about what could be wrong? I should note that the kernel is my 
distro's default. As the problem seems to be scheduler specific I didn't 
bother to compile a vanilla kernel, since the applied patches seem 
irrelevant: 

http://archlinux.org/packages/4197/
http://cvs.archlinux.org/cgi-bin/viewcvs.cgi/kernels/kernel26/?cvsroot=Current&only_with_tag=CURRENT

Thank in advance,
Dimitris

P.S.1: Please CC me directly as I'm not subscribed

P.S.2: Keep in mind that the problematic oprofile outputs probably refer 
to much longer time than 5 sec, since due to high load the script was 
taking long to complete.

P.S.3: I couldn't find anywhere in kernel documentation that setting 
nmi_watchdog=0 is neccessary for oprofile to work correctly. However, 
Documentation/nmi_watchdog.txt mentions that oprofile should disable the 
nmi_watchdog automatically, which doesn't happen with the latest kernel.

View attachment "dmesg.txt" of type "text/plain" (10767 bytes)

Download attachment "script.sh" of type "application/x-shellscript" (243 bytes)

View attachment "two_discs.txt" of type "text/plain" (15250 bytes)

View attachment "two_discs_bad.txt" of type "text/plain" (25561 bytes)

View attachment "one_disc.txt" of type "text/plain" (12013 bytes)

View attachment "idle.txt" of type "text/plain" (9967 bytes)