linux-kernel - Re: can't oom-kill zap the victim's memory?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201509291657.HHD73972.MOFVSHQtOJFOLF@I-love.SAKURA.ne.jp>
Date:	Tue, 29 Sep 2015 16:57:25 +0900
From:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:	rientjes@...gle.com, mhocko@...nel.org
Cc:	oleg@...hat.com, torvalds@...ux-foundation.org, kwalker@...hat.com,
	cl@...ux.com, akpm@...ux-foundation.org, hannes@...xchg.org,
	vdavydov@...allels.com, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, skozina@...hat.com
Subject: Re: can't oom-kill zap the victim's memory?

David Rientjes wrote:
> On Fri, 25 Sep 2015, Michal Hocko wrote:
> > > > I am still not sure how you want to implement that kernel thread but I
> > > > am quite skeptical it would be very much useful because all the current
> > > > allocations which end up in the OOM killer path cannot simply back off
> > > > and drop the locks with the current allocator semantic.  So they will
> > > > be sitting on top of unknown pile of locks whether you do an additional
> > > > reclaim (unmap the anon memory) in the direct OOM context or looping
> > > > in the allocator and waiting for kthread/workqueue to do its work. The
> > > > only argument that I can see is the stack usage but I haven't seen stack
> > > > overflows in the OOM path AFAIR.
> > > > 
> > > 
> > > Which locks are you specifically interested in?
> > 
> > Any locks they were holding before they entered the page allocator (e.g.
> > i_mutex is the easiest one to trigger from the userspace but mmap_sem
> > might be involved as well because we are doing kmalloc(GFP_KERNEL) with
> > mmap_sem held for write). Those would be locked until the page allocator
> > returns, which with the current semantic might be _never_.
> > 
> 
> I agree that i_mutex seems to be one of the most common offenders.  
> However, I'm not sure I understand why holding it while trying to allocate 
> infinitely for an order-0 allocation is problematic wrt the proposed 
> kthread.  The kthread itself need only take mmap_sem for read.  If all 
> threads sharing the mm with a victim have been SIGKILL'd, they should get 
> TIF_MEMDIE set when reclaim fails and be able to allocate so that they can 
> drop mmap_sem.  We must ensure that any holder of mmap_sem cannot quickly 
> deplete memory reserves without properly checking for 
> fatal_signal_pending().

Is the story such simple? I think there are factors which disturb memory
allocation with mmap_sem held for writing.

  down_write(&mm->mmap_sem);
  kmalloc(GFP_KERNEL);
  up_write(&mm->mmap_sem);

can involve locks inside __alloc_pages_slowpath().

Say, there are three userspace tasks named P1, P2T1, P2T2 and
one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm.
KT1 is a kernel thread for fs writeback (maybe kswapd?).
I think sequence shown below is possible.

(1) P1 enters into kernel mode via write() syscall.

(2) P1 allocates memory for buffered write.

(3) P2T1 enters into kernel mode and calls kmalloc().

(4) P2T1 arrives at __alloc_pages_may_oom() because there was no
    reclaimable memory. (Memory allocated by P1 is not reclaimable
    as of this moment.)

(5) P1 dirties memory allocated for buffered write.

(6) P2T2 enters into kernel mode and calls kmalloc() with
    mmap_sem held for writing.

(7) KT1 finds dirtied memory.

(8) KT1 holds fs's unkillable lock for fs writeback.

(9) P2T2 is blocked at unkillable lock for fs writeback held by KT1.

(10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets
     TIF_MEMDIE on both P2T1 and P2T2.

(11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
     held by KT1.

(12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and
     P2T2 cannot release memory because memory unmapping code cannot hold
     mmap_sem for reading, KT1 waits forever.... OOM livelock completed!

I think sequence shown below is also possible. Say, there are three
userspace tasks named P1, P2, P3 and one kernel thread named KT1.

(1) P1 enters into kernel mode via write() syscall.

(2) P1 allocates memory for buffered write.

(3) P2 enters into kernel mode and holds mmap_sem for writing.

(4) P3 enters into kernel mode and calls kmalloc().

(5) P3 arrives at __alloc_pages_may_oom() because there was no
    reclaimable memory. (Memory allocated by P1 is not reclaimable
    as of this moment.)

(6) P1 dirties memory allocated for buffered write.

(7) KT1 finds dirtied memory.

(8) KT1 holds fs's unkillable lock for fs writeback.

(9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback
    held by KT1.

(10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets
     TIF_MEMDIE on P2.

(11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
     held by KT1.

(12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot
     release memory because memory unmapping code cannot hold mmap_sem for
     reading, KT1 waits forever.... OOM livelock completed!

So, allowing all OOM victim threads to use memory reserves does not guarantee
that a thread which held mmap_sem for writing to make forward progress.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/