lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <14e201cd5ee9$c1b57ce0$452076a0$@com>
Date:	Tue, 10 Jul 2012 15:17:12 -0700
From:	"Alec Matusis" <alecm@...tango.com>
To:	<linux-kernel@...r.kernel.org>
Subject: bogus utime and stime in /proc/<PID/stat - possibly related to fs/proc/array.c change

I run a number of Linux servers and I noticed an interesting bug, possibly
related to a recent change in fs/proc/array.c

After upgrading from Ubuntu 2.6.24-26 to 2.6.32-40 (and higher) in Ubuntu, I
noticed that about once per month, suddenly, a user process causing the main
load on a given machine disappears from "top", but it still continues to run
normally (perhaps with a slight performance decrease). After this, the load
average of the system remains the same, but the top shows no running
processes causing the load. This happened on a variety of new IBM System X
machines, all running different tasks (httpd 2.2, mysqld 5.1, Twisted Python
TCP servers).

I looked at a problematic process, and discovered that ps -o pcpu showed
crazily large numbers:

#ps -o pcpu,pid,cmd -p1587
%CPU   PID CMD
317713124 1587 /nail/encap/mysql-5.1.60/libexec/mysqld

Then I looked at: 

# cat /proc/1587/stat
 1587 (mysqld) S 1212 1088 1088 0 -1 4202752 14307313 0 162 0 85773299069
4611685932654088833 0 0 20 0 52 0 3549 27255418880 5483524
18446744073709551615 4194304 11111617 140733749236976 140733749235984
8858659 0 552967 4102 26345 18446744073709551615 0 0 17 5 0 0 0 0

I noticed that the 14th and 15th entry 85773299069     4611685932654088833
(utime and stime) become abnormally large and they were stuck. When the
server is in the normal state (i.e. the system load-causing process shows up
on top, and ps -o pcpu shows reasonable %CPU) , these numbers are 13 orders
of magnitude smaller, e.g.  416786 602262, and they are advancing by about
10 per second. 

I do not understand what causes this problem, expect that I know that
machines with 2.6.24-26 or earlier do not have this behavior, and since then
there was a change in fs/proc/array.c.

I wrote this up in detail in
http://serverfault.com/questions/406489/load-causing-processes-disappearing-
from-top-ps-o-pcpu-shows-bogus-numbers

If you have any comment on this, it'd be highly appreciated.

Thank you.


Alec Matusis



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ