linux-kernel - Re: [patch 0/7] improve memcg oom killer robustness v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20130904114523.A9F0173C@pobox.sk>
Date:	Wed, 04 Sep 2013 11:45:23 +0200
From:	"azurIt" <azurit@...ox.sk>
To:	Johannes Weiner <hannes@...xchg.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Michal Hocko <mhocko@...e.cz>,
	David Rientjes <rientjes@...gle.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	<linux-mm@...ck.org>, <cgroups@...r.kernel.org>, <x86@...nel.org>,
	<linux-arch@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [patch 0/7] improve memcg oom killer robustness v2

>Hello azur,
>
>On Mon, Sep 02, 2013 at 12:38:02PM +0200, azurIt wrote:
>> >>Hi azur,
>> >>
>> >>here is the x86-only rollup of the series for 3.2.
>> >>
>> >>Thanks!
>> >>Johannes
>> >>---
>> >
>> >
>> >Johannes,
>> >
>> >unfortunately, one problem arises: I have (again) cgroup which cannot be deleted :( it's a user who had very high memory usage and was reaching his limit very often. Do you need any info which i can gather now?
>
>Did the OOM killer go off in this group?
>
>Was there a warning in the syslog ("Fixing unhandled memcg OOM
>context")?
>
>If it happens again, could you check if there are tasks left in the
>cgroup?  And provide /proc/<pid>/stack of the hung task trying to
>delete the cgroup?
>
>> Now i can definitely confirm that problem is NOT fixed :( it happened again but i don't have any data because i already disabled all debug output.
>
>Which debug output?
>
>Do you still have access to the syslog?
>
>It's possible that, as your system does not deadlock on the OOMing
>cgroup anymore, you hit a separate bug...
>
>Thanks!



My script has just detected (and killed) another freezed cgroup. I must say that i'm not 100% sure that cgroup was really freezed but it has 99% or more memory usage for at least 30 seconds (well, or it has 99% memory usage in both two cases the script was checking it). Here are stacks of processes inside it before they were killed:



pid: 26490
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26503
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26517
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26518
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26519
stack:
[<ffffffff815cb618>] retint_careful+0xd/0x1a
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26520
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26521
stack:
[<ffffffff815cb618>] retint_careful+0xd/0x1a
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26522
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26523
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26524
stack:
[<ffffffff81052671>] sys_sched_yield+0x41/0x70
[<ffffffff81148d91>] free_more_memory+0x21/0x60
[<ffffffff8114941d>] __getblk+0x14d/0x2c0
[<ffffffff8119888b>] ext3_getblk+0xeb/0x240
[<ffffffff811989f9>] ext3_bread+0x19/0x90
[<ffffffff8119cea3>] ext3_dx_find_entry+0x83/0x1e0
[<ffffffff8119d2e4>] ext3_find_entry+0x2e4/0x480
[<ffffffff8119dbcd>] ext3_lookup+0x4d/0x120
[<ffffffff811228f5>] d_alloc_and_lookup+0x45/0x90
[<ffffffff81125578>] __lookup_hash+0xa8/0xf0
[<ffffffff81127852>] do_last+0x312/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26526
stack:
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26531
stack:
[<ffffffff81127842>] do_last+0x302/0xa60
[<ffffffff81128077>] path_openat+0xd7/0x470
[<ffffffff81128529>] do_filp_open+0x49/0xa0
[<ffffffff81114a16>] do_sys_open+0x106/0x240
[<ffffffff81114b90>] sys_open+0x20/0x30
[<ffffffff815cbce6>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26533
stack:
[<ffffffff815cb618>] retint_careful+0xd/0x1a
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26536
stack:
[<ffffffff81080a45>] refrigerator+0x95/0x160
[<ffffffff8106ac2b>] get_signal_to_deliver+0x1cb/0x540
[<ffffffff8100188b>] do_signal+0x6b/0x750
[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
[<ffffffff815cb662>] retint_signal+0x3d/0x7b
[<ffffffffffffffff>] 0xffffffffffffffff


pid: 26539
stack:
[<ffffffff815cb618>] retint_careful+0xd/0x1a
[<ffffffffffffffff>] 0xffffffffffffffff
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/