linux-kernel - Re: cgroup OOM killer loop causes system to lockup (possible fix included)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4DE3D5A8.8060707@simplicitymedialtd.co.uk>
Date:	Mon, 30 May 2011 18:36:40 +0100
From:	"Cal Leeming [Simplicity Media Ltd]" 
	<cal.leeming@...plicitymedialtd.co.uk>
To:	linux-kernel@...r.kernel.org,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"kosaki.motohiro@...fujitsu.com" <kosaki.motohiro@...fujitsu.com>,
	"minchan.kim@...il.com" <minchan.kim@...il.com>,
	"rientjes@...gle.com" <rientjes@...gle.com>,
	kamezawa.hiroyu@...fujitsu.com
Subject: Re: cgroup OOM killer loop causes system to lockup (possible fix
 included)

Here is an strace of the SSH process (which is somehow inheriting the 
-17 oom_adj on all forked user instances)

(broken server - with bnx2 module loaded)
[pid  2200] [    7f13a09c9cb0] open("/proc/self/oom_adj", 
O_WRONLY|O_CREAT|O_TRUNC, 0666 <unfinished ...>
[pid  2120] [    7f13a09c9f00] write(7, "\0\0\2\240\n\n\n\nPort 
22\n\n\n\nProtocol 2\n\nH"..., 680 <unfinished ...>
[pid  2200] [    7f13a09c9cb0] <... open resumed> ) = 9
[pid  2120] [    7f13a09c9f00] <... write resumed> ) = 680
[pid  2120] [    7f13a09c9e40] close(7 <unfinished ...>
[pid  2200] [    7f13a09c9844] fstat(9, <unfinished ...>
[pid  2120] [    7f13a09c9e40] <... close resumed> ) = 0
[pid  2200] [    7f13a09c9844] <... fstat resumed> 
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid  2120] [    7f13a09c9e40] close(8 <unfinished ...>
[pid  2200] [    7f13a09d2a2a] mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid  2120] [    7f13a09c9e40] <... close resumed> ) = 0
[pid  2200] [    7f13a09d2a2a] <... mmap resumed> ) = 0x7f13a25a6000
[pid  2120] [    7f13a09c9e40] close(4 <unfinished ...>
[pid  2200] [    7f13a09c9f00] write(9, "-17\n", 4 <unfinished ...>


(working server - with bnx2 module unloaded)
[pid  1323] [    7fae577fbe40] close(7) = 0
[pid  1631] [    7fae577fbcb0] open("/proc/self/oom_adj", 
O_WRONLY|O_CREAT|O_TRUNC, 0666 <unfinished ...>
[pid  1323] [    7fae577fbf00] write(8, "\0\0\2\217\0", 5 <unfinished ...>
[pid  1631] [    7fae577fbcb0] <... open resumed> ) = 10
[pid  1323] [    7fae577fbf00] <... write resumed> ) = 5
[pid  1323] [    7fae577fbf00] write(8, "\0\0\2\206\n\n\n\nPort 
22\n\n\n\nProtocol 2\n\nH"..., 654 <unfinished ...>
[pid  1631] [    7fae577fb844] fstat(10, <unfinished ...>
[pid  1323] [    7fae577fbf00] <... write resumed> ) = 654
[pid  1631] [    7fae577fb844] <... fstat resumed> 
{st_mode=S_IFREG|0644, st_size=0, ...}) = 0
[pid  1323] [    7fae577fbe40] close(8) = 0
[pid  1631] [    7fae57804a2a] mmap(NULL, 4096, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid  1323] [    7fae577fbe40] close(9 <unfinished ...>
[pid  1631] [    7fae57804a2a] <... mmap resumed> ) = 0x7fae593d9000
[pid  1323] [    7fae577fbe40] <... close resumed> ) = 0
[pid  1323] [    7fae577fbe40] close(5 <unfinished ...>
[pid  1631] [    7fae577fbf00] write(10, "0\n", 2 <unfinished ...>

The two servers are *EXACT* duplicates of each other, completely fresh 
Debian installs, with exactly the same packages installed.

As you can see, the working server sends "0" into the oom_adj and the 
broken one sends "-17".


On 30/05/2011 15:27, Cal Leeming [Simplicity Media Ltd] wrote:
> I FOUND THE PROBLEM!!!
>
> Explicit details can be found on the Debian kernel mailing list, but 
> to cut short, it's caused by the firmware-bnx2 kernel module:
>
> The broken server uses 'firmware-bnx2'.. so I purged the bnx2 package, 
> removed the bnx*.ko files from /lib/modules, ran update-initramfs, and 
> then rebooted (i then confirmed it was removed by checking ifconfig 
> and lsmod).
>
> And guess what.. IT WORKED.
>
> So, this problem seems to be caused by the firmware-bnx2 module being 
> loaded.. some how, that module is causing -17 oom_adj to be set for 
> everything..
>
> WTF?!?! Surely a bug?? Could someone please forward this to the 
> appropriate person for the bnx2 kernel module, as I wouldn't even know 
> where to begin :S
>
> Cal
>
> On 30/05/2011 11:52, Cal Leeming [Simplicity Media Ltd] wrote:
>> -resent due to incorrect formatting, sorry if this dups!
>>
>> @Kame
>> Thanks for the reply!
>> Both kernels used the same env/dist, but which slightly different 
>> packages.
>> After many frustrating hours, I have pin pointed this down to a dodgy
>> Debian package which appears to continue affecting the system, even
>> after purging. I'm still yet to pin point the package down (I'm doing
>> several reinstall tests, along with tripwire analysis after each
>> reboot).
>>
>> @Hiroyuki
>> Thank you for sending this to the right people!
>>
>> @linux-mm
>> On a side note, would someone mind taking a few minutes to give a
>> brief explanation as to how the default oom_adj is set, and under what
>> conditions it is given -17 by default? Is this defined by the
>> application? I looked through the kernel source,
>> and noticed some of the code was defaulted to set oom_adj to
>> OOM_DISABLE (which is defined in the headers as -17).
>>
>> Assuming the debian problem is resolved, this might be another call
>> for the oom-killer to be modified so that if it encounters the
>> unrecoverable loop, it ignores the -17 rule (with some exceptions,
>> such as kernel processes, and other critical things). If this is going
>> to be a relatively simple task, I wouldn't mind spending a few hours
>> patching this?
>>
>> Cal
>>
>> On Mon, May 30, 2011 at 3:23 AM, KAMEZAWA Hiroyuki
>> <kamezawa.hiroyu@...fujitsu.com>  wrote:
>>> Thank you. memory cgroup and OOM troubles are handled in linux-mm.
>>>
>>> On Sun, 29 May 2011 23:24:07 +0100
>>> "Cal Leeming [Simplicity Media 
>>> Ltd]"<cal.leeming@...plicitymedialtd.co.uk>  wrote:
>>>
>>>> Some further logs:
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.369927] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.369939]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.399285] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.399296]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.428690] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.428702]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.487696] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.487708]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.517023] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.517035]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.546379] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:38 vicky kernel: [ 2283.546391]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.310789] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.310804]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.369918] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.369930]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.399284] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.399296]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.433634] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.433648]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.463947] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.463959]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.493439] redis-server
>>>> invoked oom-killer: gfp_mask=0xd0, order=0, oom_adj=-17
>>>> ./log/syslog:May 30 07:44:43 vicky kernel: [ 2288.493451]
>>>> [<ffffffff810b12b7>] ? oom_kill_process+0x82/0x283
>>>>
>>>>
>>> hmm, in short, applications has -17 oom_adj in default with 2.6.32.41 ?
>>> AFAIK, no kernel has such crazy settings as default..
>>>
>>> Does your 2 kernel uses the same environment/distribution ?
>>>
>>> Thanks,
>>> -Kame
>>>
>>>> On 29/05/2011 22:50, Cal Leeming [Simplicity Media Ltd] wrote:
>>>>>   First of all, my apologies if I have submitted this problem to the
>>>>> wrong place, spent 20 minutes trying to figure out where it needs to
>>>>> be sent, and was still none the wiser.
>>>>>
>>>>> The problem is related to applying memory limitations within a 
>>>>> cgroup.
>>>>> If the OOM killer kicks in, it gets stuck in a loop where it tries to
>>>>> kill a process which has an oom_adj of -17. This causes an infinite
>>>>> loop, which in turn locks up the system.
>>>>>
>>>>> May 30 03:13:08 vicky kernel: [ 1578.117055] Memory cgroup out of
>>>>> memory: kill process 6016 (java) score 0 or a child
>>>>> May 30 03:13:08 vicky kernel: [ 1578.117154] Memory cgroup out of
>>>>> memory: kill process 6016 (java) score 0 or a child
>>>>> May 30 03:13:08 vicky kernel: [ 1578.117248] Memory cgroup out of
>>>>> memory: kill process 6016 (java) score 0 or a child
>>>>> May 30 03:13:08 vicky kernel: [ 1578.117343] Memory cgroup out of
>>>>> memory: kill process 6016 (java) score 0 or a child
>>>>> May 30 03:13:08 vicky kernel: [ 1578.117441] Memory cgroup out of
>>>>> memory: kill process 6016 (java) score 0 or a child
>>>>>
>>>>>
>>>>>   root@...ky [/home/foxx]>  uname -a
>>>>> Linux vicky 2.6.32.41-grsec #3 SMP Mon May 30 02:34:43 BST 2011 
>>>>> x86_64
>>>>> GNU/Linux
>>>>> (this happens on both the grsec patched and non patched 2.6.32.41 
>>>>> kernel)
>>>>>
>>>>> When this is encountered, the memory usage across the whole server is
>>>>> still within limits (not even hitting swap).
>>>>>
>>>>> The memory configuration for the cgroup/lxc is:
>>>>> lxc.cgroup.memory.limit_in_bytes = 3000M
>>>>> lxc.cgroup.memory.memsw.limit_in_bytes = 3128M
>>>>>
>>>>> Now, what is even more strange, is that when running under the
>>>>> 2.6.32.28 kernel (both patched and unpatched), this problem doesn't
>>>>> happen. However, there is a slight difference between the two 
>>>>> kernels.
>>>>> The 2.6.32.28 kernel gives a default of 0 in the /proc/X/oom_adj,
>>>>> where as the 2.6.32.41 gives a default of -17. I suspect this is the
>>>>> root cause of why it's showing in the later kernel, but not the 
>>>>> earlier.
>>>>>
>>>>> To test this theory, I started up the lxc on both servers, and then
>>>>> ran a one liner which showed me all the processes with an oom_adj 
>>>>> of -17:
>>>>>
>>>>> (the below is the older/working kernel)
>>>>> root@...rtney.internal [/mnt/encstore/lxc]>  uname -a
>>>>> Linux courtney.internal 2.6.32.28-grsec #3 SMP Fri Feb 18 16:09:07 
>>>>> GMT
>>>>> 2011 x86_64 GNU/Linux
>>>>> root@...rtney.internal [/mnt/encstore/lxc]>  for x in `find /proc
>>>>> -iname 'oom_adj' | xargs grep "\-17"  | awk -F '/' '{print $3}'` ; do
>>>>> ps -p $x --no-headers ; done
>>>>> grep: /proc/1411/task/1411/oom_adj: No such file or directory
>>>>> grep: /proc/1411/oom_adj: No such file or directory
>>>>>    804 ?        00:00:00 udevd
>>>>>    804 ?        00:00:00 udevd
>>>>> 25536 ?        00:00:00 sshd
>>>>> 25536 ?        00:00:00 sshd
>>>>> 31861 ?        00:00:00 sshd
>>>>> 31861 ?        00:00:00 sshd
>>>>> 32173 ?        00:00:00 udevd
>>>>> 32173 ?        00:00:00 udevd
>>>>> 32174 ?        00:00:00 udevd
>>>>> 32174 ?        00:00:00 udevd
>>>>>
>>>>> (the below is the newer/broken kernel)
>>>>>   root@...ky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]>  uname -a
>>>>> Linux vicky 2.6.32.41-grsec #3 SMP Mon May 30 02:34:43 BST 2011 
>>>>> x86_64
>>>>> GNU/Linux
>>>>>   root@...ky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]>  for x in
>>>>> `find /proc -iname 'oom_adj' | xargs grep "\-17"  | awk -F '/' 
>>>>> '{print
>>>>> $3}'` ; do ps -p $x --no-headers ; done
>>>>> grep: /proc/3118/task/3118/oom_adj: No such file or directory
>>>>> grep: /proc/3118/oom_adj: No such file or directory
>>>>>    895 ?        00:00:00 udevd
>>>>>    895 ?        00:00:00 udevd
>>>>>   1091 ?        00:00:00 udevd
>>>>>   1091 ?        00:00:00 udevd
>>>>>   1092 ?        00:00:00 udevd
>>>>>   1092 ?        00:00:00 udevd
>>>>>   2596 ?        00:00:00 sshd
>>>>>   2596 ?        00:00:00 sshd
>>>>>   2608 ?        00:00:00 sshd
>>>>>   2608 ?        00:00:00 sshd
>>>>>   2613 ?        00:00:00 sshd
>>>>>   2613 ?        00:00:00 sshd
>>>>>   2614 pts/0    00:00:00 bash
>>>>>   2614 pts/0    00:00:00 bash
>>>>>   2620 pts/0    00:00:00 sudo
>>>>>   2620 pts/0    00:00:00 sudo
>>>>>   2621 pts/0    00:00:00 su
>>>>>   2621 pts/0    00:00:00 su
>>>>>   2622 pts/0    00:00:00 bash
>>>>>   2622 pts/0    00:00:00 bash
>>>>>   2685 ?        00:00:00 lxc-start
>>>>>   2685 ?        00:00:00 lxc-start
>>>>>   2699 ?        00:00:00 init
>>>>>   2699 ?        00:00:00 init
>>>>>   2939 ?        00:00:00 rc
>>>>>   2939 ?        00:00:00 rc
>>>>>   2942 ?        00:00:00 startpar
>>>>>   2942 ?        00:00:00 startpar
>>>>>   2964 ?        00:00:00 rsyslogd
>>>>>   2964 ?        00:00:00 rsyslogd
>>>>>   2964 ?        00:00:00 rsyslogd
>>>>>   2964 ?        00:00:00 rsyslogd
>>>>>   2980 ?        00:00:00 startpar
>>>>>   2980 ?        00:00:00 startpar
>>>>>   2981 ?        00:00:00 ctlscript.sh
>>>>>   2981 ?        00:00:00 ctlscript.sh
>>>>>   3016 ?        00:00:00 cron
>>>>>   3016 ?        00:00:00 cron
>>>>>   3025 ?        00:00:00 mysqld_safe
>>>>>   3025 ?        00:00:00 mysqld_safe
>>>>>   3032 ?        00:00:00 sshd
>>>>>   3032 ?        00:00:00 sshd
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3097 ?        00:00:00 mysqld.bin
>>>>>   3113 ?        00:00:00 ctl.sh
>>>>>   3113 ?        00:00:00 ctl.sh
>>>>>   3115 ?        00:00:00 sleep
>>>>>   3115 ?        00:00:00 sleep
>>>>>   3116 ?        00:00:00 .memcached.bin
>>>>>   3116 ?        00:00:00 .memcached.bin
>>>>>
>>>>>
>>>>> As you can see, it is clear that the newer kernel is setting -17 by
>>>>> default, which in turn is causing the OOM killer loop.
>>>>>
>>>>> So I began to try and find what may have caused this problem by
>>>>> comparing the two sources...
>>>>>
>>>>> I checked the code for all references to 'oom_adj' and 
>>>>> 'oom_adjust' in
>>>>> both code sets, but found no obvious differences:
>>>>> grep -R -e oom_adjust -e oom_adj . | sort | grep -R -e oom_adjust -e
>>>>> oom_adj
>>>>>
>>>>> Then I checked for references to "-17" in all .c and .h files, and
>>>>> found a couple of matches, but only one obvious one:
>>>>> grep -R "\-17" . | grep -e ".c:" -e ".h:" -e "\-17" | wc -l
>>>>> ./include/linux/oom.h:#define OOM_DISABLE (-17)
>>>>>
>>>>> But again, a search for OOM_DISABLE came up with nothing obvious...
>>>>>
>>>>> In a last ditch attempt, I did a search for all references to 'oom'
>>>>> (case-insensitive) in both code bases, then compared the two:
>>>>>   root@...abelle [~/lol/linux-2.6.32.28]>  grep -i -R "oom" . | 
>>>>> sort -n
>>>>>> /tmp/annabelle.oom_adj
>>>>>   root@...ky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]>  grep -i -R
>>>>> "oom" . | sort -n>  /tmp/vicky.oom_adj
>>>>>
>>>>> and this brought back (yet again) nothing obvious..
>>>>>
>>>>>
>>>>>   root@...ky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]>  md5sum
>>>>> ./include/linux/oom.h
>>>>> 2a32622f6cd38299fc2801d10a9a3ea8  ./include/linux/oom.h
>>>>>
>>>>>   root@...abelle [~/lol/linux-2.6.32.28]>  md5sum 
>>>>> ./include/linux/oom.h
>>>>> 2a32622f6cd38299fc2801d10a9a3ea8  ./include/linux/oom.h
>>>>>
>>>>>   root@...ky [/mnt/encstore/ssd/kernel/linux-2.6.32.41]>  md5sum
>>>>> ./mm/oom_kill.c
>>>>> 1ef2c2bec19868d13ec66ec22033f10a  ./mm/oom_kill.c
>>>>>
>>>>>   root@...abelle [~/lol/linux-2.6.32.28]>  md5sum ./mm/oom_kill.c
>>>>> 1ef2c2bec19868d13ec66ec22033f10a  ./mm/oom_kill.c
>>>>>
>>>>>
>>>>>
>>>>> Could anyone please shed some light as to why the default oom_adj is
>>>>> set to -17 now (and where it is actually set)? From what I can tell,
>>>>> the fix for this issue will either be:
>>>>>
>>>>>    1. Allow OOM killer to override the decision of ignoring 
>>>>> oom_adj ==
>>>>>       -17 if an unrecoverable loop is encountered.
>>>>>    2. Change the default back to 0.
>>>>>
>>>>> Again, my apologies if this bug report is slightly unorthodox, or
>>>>> doesn't follow usual procedure etc. I can assure you I have tried my
>>>>> absolute best to give all the necessary information though.
>>>>>
>>>>> Cal
>>>>>
>>>> -- 
>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>> linux-kernel" in
>>>> the body of a message to majordomo@...r.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> Please read the FAQ at  http://www.tux.org/lkml/
>>>>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/