linux-kernel - Re: [PATCH -mm] memcg: do not trigger OOM from add_to_page_cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20121126174622.GE2799@cmpxchg.org>
Date:	Mon, 26 Nov 2012 12:46:22 -0500
From:	Johannes Weiner <hannes@...xchg.org>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	azurIt <azurit@...ox.sk>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, cgroups mailinglist <cgroups@...r.kernel.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Subject: Re: [PATCH -mm] memcg: do not trigger OOM from
 add_to_page_cache_locked

On Mon, Nov 26, 2012 at 02:18:37PM +0100, Michal Hocko wrote:
> [CCing also Johannes - the thread started here:
> https://lkml.org/lkml/2012/11/21/497]
> 
> On Mon 26-11-12 01:38:55, azurIt wrote:
> > >This is hackish but it should help you in this case. Kamezawa, what do
> > >you think about that? Should we generalize this and prepare something
> > >like mem_cgroup_cache_charge_locked which would add __GFP_NORETRY
> > >automatically and use the function whenever we are in a locked context?
> > >To be honest I do not like this very much but nothing more sensible
> > >(without touching non-memcg paths) comes to my mind.
> > 
> > 
> > I installed kernel with this patch, will report back if problem occurs
> > again OR in few weeks if everything will be ok. Thank you!
> 
> Now that I am looking at the patch closer it will not work because it
> depends on other patch which is not merged yet and even that one would
> help on its own because __GFP_NORETRY doesn't break the charge loop.
> Sorry I have missed that...
> 
> The patch bellow should help though. (it is based on top of the current
> -mm tree but I will send a backport to 3.2 in the reply as well)
> ---
> >From 7796f942d62081ad45726efd90b5292b80e7c690 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@...e.cz>
> Date: Mon, 26 Nov 2012 11:47:57 +0100
> Subject: [PATCH] memcg: do not trigger OOM from add_to_page_cache_locked
> 
> memcg oom killer might deadlock if the process which falls down to
> mem_cgroup_handle_oom holds a lock which prevents other task to
> terminate because it is blocked on the very same lock.
> This can happen when a write system call needs to allocate a page but
> the allocation hits the memcg hard limit and there is nothing to reclaim
> (e.g. there is no swap or swap limit is hit as well and all cache pages
> have been reclaimed already) and the process selected by memcg OOM
> killer is blocked on i_mutex on the same inode (e.g. truncate it).
> 
> Process A
> [<ffffffff811109b8>] do_truncate+0x58/0xa0		# takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> Process B
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff

So process B manages to lock the hierarchy, calls
mem_cgroup_out_of_memory() and retries the charge infinitely, waiting
for task A to die.  All while it holds the i_mutex, preventing task A
from dying, right?

I think global oom already handles this in a much better way: invoke
the OOM killer, sleep for a second, then return to userspace to
relinquish all kernel resources and locks.  The only reason why we
can't simply change from an endless retry loop is because we don't
want to return VM_FAULT_OOM and invoke the global OOM killer.  But
maybe we can return a new VM_FAULT_OOM_HANDLED for memcg OOM and just
restart the pagefault.  Return -ENOMEM to the buffered IO syscall
respectively.  This way, the memcg OOM killer is invoked as it should
but nobody gets stuck anywhere livelocking with the exiting task.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/