[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1509281512330.13657@chino.kir.corp.google.com>
Date: Mon, 28 Sep 2015 15:24:06 -0700 (PDT)
From: David Rientjes <rientjes@...gle.com>
To: Michal Hocko <mhocko@...nel.org>
cc: Oleg Nesterov <oleg@...hat.com>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Kyle Walker <kwalker@...hat.com>,
Christoph Lameter <cl@...ux.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>,
Vladimir Davydov <vdavydov@...allels.com>,
linux-mm <linux-mm@...ck.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Stanislav Kozina <skozina@...hat.com>,
Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
Subject: Re: can't oom-kill zap the victim's memory?
On Fri, 25 Sep 2015, Michal Hocko wrote:
> > > I am still not sure how you want to implement that kernel thread but I
> > > am quite skeptical it would be very much useful because all the current
> > > allocations which end up in the OOM killer path cannot simply back off
> > > and drop the locks with the current allocator semantic. So they will
> > > be sitting on top of unknown pile of locks whether you do an additional
> > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > only argument that I can see is the stack usage but I haven't seen stack
> > > overflows in the OOM path AFAIR.
> > >
> >
> > Which locks are you specifically interested in?
>
> Any locks they were holding before they entered the page allocator (e.g.
> i_mutex is the easiest one to trigger from the userspace but mmap_sem
> might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> mmap_sem held for write). Those would be locked until the page allocator
> returns, which with the current semantic might be _never_.
>
I agree that i_mutex seems to be one of the most common offenders.
However, I'm not sure I understand why holding it while trying to allocate
infinitely for an order-0 allocation is problematic wrt the proposed
kthread. The kthread itself need only take mmap_sem for read. If all
threads sharing the mm with a victim have been SIGKILL'd, they should get
TIF_MEMDIE set when reclaim fails and be able to allocate so that they can
drop mmap_sem. We must ensure that any holder of mmap_sem cannot quickly
deplete memory reserves without properly checking for
fatal_signal_pending().
> > We have already discussed
> > the usefulness of killing all threads on the system sharing the same ->mm,
> > meaning all threads that are either holding or want to hold mm->mmap_sem
> > will be able to allocate into memory reserves. Any allocator holding
> > down_write(&mm->mmap_sem) should be able to allocate and drop its lock.
> > (Are you concerned about MAP_POPULATE?)
>
> I am not sure I understand. We would have to fail the request in order
> the context which requested the memory could drop the lock. Are we
> talking about the same thing here?
>
Not fail the request, they should be able to allocate from memory reserves
when TIF_MEMDIE gets set. This would require that threads is all gfp
contexts are able to get TIF_MEMDIE set without an explicit call to
out_of_memory() for !__GFP_FS.
> > Heh, it's actually imperative to avoid livelocking based on mm->mmap_sem,
> > it's the reason the code exists. Any optimizations to that is certainly
> > welcome, but we definitely need to send SIGKILL to all threads sharing the
> > mm to make forward progress, otherwise we are going back to pre-2008
> > livelocks.
>
> Yes but mm is not shared between processes most of the time. CLONE_VM
> without CLONE_THREAD is more a corner case yet we have to crawl all the
> task_structs for _each_ OOM killer invocation. Yes this is an extreme
> slow path but still might take quite some unnecessarily time.
>
It must solve the issue you describe, killing other processes that share
the ->mm, otherwise we have mm->mmap_sem livelock. We are not concerned
about iterating over all task_structs in the oom killer as a painpoint,
such users should already be using oom_kill_allocating_task which is why
it was introduced.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists