[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131128021809.GI3556@cmpxchg.org>
Date: Wed, 27 Nov 2013 21:18:09 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: David Rientjes <rientjes@...gle.com>
Cc: Andrew Morton <akpm@...ux-foundation.org>, stable@...nel.org,
Michal Hocko <mhocko@...e.cz>, azurit@...ox.sk,
mm-commits@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-mm@...ck.org
Subject: Re: [merged]
mm-memcg-handle-non-error-oom-situations-more-gracefully.patch removed from
-mm tree
On Wed, Nov 27, 2013 at 04:56:04PM -0800, David Rientjes wrote:
> On Wed, 27 Nov 2013, Johannes Weiner wrote:
>
> > > The memcg oom killer has incurred a serious regression since the 3.12-rc6
> > > kernel where this was merged as 4942642080ea ("mm: memcg: handle non-error
> > > OOM situations more gracefully"). It cc'd stable@...nel.org although it
> > > doesn't appear to have been picked up yet, and I'm hoping that we can
> > > avoid having it merged in a stable kernel until we get this fixed.
> > >
> > > This patch, specifically the above, allows memcgs to bypass their limits
> > > by charging the root memcg in oom conditions.
> > >
> > > If I create a memcg, cg1, with memory.limit_in_bytes == 128MB and start a
> > > memory allocator in it to induce oom, the memcg limit is trivially broken:
> > >
> > > membench invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> > > membench cpuset=/ mems_allowed=0-3
> > > CPU: 9 PID: 11388 Comm: membench Not tainted 3.13-rc1
> > > ffffc90015ec6000 ffff880671c3dc18 ffffffff8154a1e3 0000000000000007
> > > ffff880674c215d0 ffff880671c3dc98 ffffffff81548b45 ffff880671c3dc58
> > > ffffffff81151de7 0000000000000001 0000000000000292 ffff880800000000
> > > Call Trace:
> > > [<ffffffff8154a1e3>] dump_stack+0x46/0x58
> > > [<ffffffff81548b45>] dump_header+0x7a/0x1bb
> > > [<ffffffff81151de7>] ? find_lock_task_mm+0x27/0x70
> > > [<ffffffff812e8b76>] ? ___ratelimit+0x96/0x110
> > > [<ffffffff811521c4>] oom_kill_process+0x1c4/0x330
> > > [<ffffffff81099ee5>] ? has_ns_capability_noaudit+0x15/0x20
> > > [<ffffffff811a916a>] mem_cgroup_oom_synchronize+0x50a/0x570
> > > [<ffffffff811a8660>] ? __mem_cgroup_try_charge_swapin+0x70/0x70
> > > [<ffffffff81152968>] pagefault_out_of_memory+0x18/0x90
> > > [<ffffffff81547b86>] mm_fault_error+0xb7/0x15a
> > > [<ffffffff81553efb>] __do_page_fault+0x42b/0x500
> > > [<ffffffff810c667d>] ? set_next_entity+0xad/0xd0
> > > [<ffffffff810c670b>] ? pick_next_task_fair+0x6b/0x170
> > > [<ffffffff8154d08e>] ? __schedule+0x38e/0x780
> > > [<ffffffff81553fde>] do_page_fault+0xe/0x10
> > > [<ffffffff815509e2>] page_fault+0x22/0x30
> > > Task in /cg1 killed as a result of limit of /cg1
> > > memory: usage 131072kB, limit 131072kB, failcnt 1053
> > > memory+swap: usage 0kB, limit 18014398509481983kB, failcnt 0
> > > kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
> > > Memory cgroup stats for /cg1: cache:84KB rss:130988KB rss_huge:116736KB mapped_file:72KB writeback:0KB inactive_anon:0KB active_anon:130976KB inactive_file:4KB active_file:0KB unevictable:0KB
> > > [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
> > > [ 7761] 0 7761 1106 483 5 0 0 bash
> > > [11388] 99 11388 270773 33031 83 0 0 membench
> > > Memory cgroup out of memory: Kill process 11388 (membench) score 1010 or sacrifice child
> > > Killed process 11388 (membench) total-vm:1083092kB, anon-rss:130824kB, file-rss:1300kB
> > >
> > > The score of 1010 shown for pid 11388 (membench) should never happen in
> > > the oom killer, the maximum value should always be 1000 in any oom
> > > context. This indicates that the process has allocated more memory than
> > > is available to the memcg. The rss value, 33031 pages, shows that it has
> > > allocated >129MB of memory in a memcg limited to 128MB.
> > >
> > > The entire premise of memcg is to prevent processes attached to it to not
> > > be able to allocate more memory than allowed and this trivially breaks
> > > that premise in oom conditions.
> >
> > We already allow a task to allocate beyond the limit if it's selected
> > by the OOM killer, so that it can exit faster.
> >
> > My patch added that a task can bypass the limit when it decided to
> > trigger the OOM killer, so that it can get to the OOM kill faster.
> >
>
> The task that is bypassing the memcg charge to the root memcg may not be
> the process that is chosen by the oom killer, and it's possible the amount
> of memory freed by killing the victim is less than the amount of memory
> bypassed.
That's true, though unlikely.
> > So I don't think my patch has broken "the entire premise of memcgs".
> > At the same time, it also does not really rely on that bypass, we
> > should be able to remove it.
> >
> > This patch series was not supposed to go into the last merge window, I
> > already told stable to hold off on these until further notice.
> >
>
> Were you targeting these to 3.13 instead? If so, it would have already
> appeared in 3.13-rc1 anyway. Is it still a work in progress?
I don't know how to answer this question.
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 13b9d0f..5f9e467 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2675,7 +2675,7 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
> > goto bypass;
> >
> > if (unlikely(task_in_memcg_oom(current)))
> > - goto bypass;
> > + goto nomem;
> >
> > /*
> > * We always charge the cgroup the mm_struct belongs to.
>
> Is there any benefit to doing this over just schedule_timeout_killable()
> since we need to wait for mem_cgroup_oom_synchronize() to be able to make
> forward progress at this point?
This does not make any sense, current is the process that will execute
mem_cgroup_oom_synchronize(). current entered OOM, it will kill once
the fault handler returns. This check is there to quickly end any
subsequent attempts of the fault handler to allocate memory after a
failed allocation and to get quickly to the OOM killer.
> Should we be checking mem_cgroup_margin() here to ensure
> task_in_memcg_oom() is still accurate and we haven't raced by freeing
> memory?
We would have invoked the OOM killer long before this point prior to
my patches. There is a line we draw and from that point on we start
killing things. I tried to explain multiple times now that there is
no race-free OOM killing and I'm tired of it. Convince me otherwise
or stop repeating this non-sense.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists