linux-kernel - Re: can't oom-kill zap the victim's memory?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1509291547560.3375@chino.kir.corp.google.com>
Date:	Tue, 29 Sep 2015 15:56:30 -0700 (PDT)
From:	David Rientjes <rientjes@...gle.com>
To:	Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>
cc:	mhocko@...nel.org, oleg@...hat.com, torvalds@...ux-foundation.org,
	kwalker@...hat.com, cl@...ux.com, akpm@...ux-foundation.org,
	hannes@...xchg.org, vdavydov@...allels.com, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, skozina@...hat.com
Subject: Re: can't oom-kill zap the victim's memory?

On Tue, 29 Sep 2015, Tetsuo Handa wrote:

> Is the story such simple? I think there are factors which disturb memory
> allocation with mmap_sem held for writing.
> 
>   down_write(&mm->mmap_sem);
>   kmalloc(GFP_KERNEL);
>   up_write(&mm->mmap_sem);
> 
> can involve locks inside __alloc_pages_slowpath().
> 
> Say, there are three userspace tasks named P1, P2T1, P2T2 and
> one kernel thread named KT1. Only P2T1 and P2T2 shares the same mm.
> KT1 is a kernel thread for fs writeback (maybe kswapd?).
> I think sequence shown below is possible.
> 
> (1) P1 enters into kernel mode via write() syscall.
> 
> (2) P1 allocates memory for buffered write.
> 
> (3) P2T1 enters into kernel mode and calls kmalloc().
> 
> (4) P2T1 arrives at __alloc_pages_may_oom() because there was no
>     reclaimable memory. (Memory allocated by P1 is not reclaimable
>     as of this moment.)
> 
> (5) P1 dirties memory allocated for buffered write.
> 
> (6) P2T2 enters into kernel mode and calls kmalloc() with
>     mmap_sem held for writing.
> 
> (7) KT1 finds dirtied memory.
> 
> (8) KT1 holds fs's unkillable lock for fs writeback.
> 
> (9) P2T2 is blocked at unkillable lock for fs writeback held by KT1.
> 
> (10) P2T1 calls out_of_memory() and the OOM killer chooses P2T1 and sets
>      TIF_MEMDIE on both P2T1 and P2T2.
> 
> (11) P2T2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
>      held by KT1.
> 
> (12) KT1 is trying to allocate memory for fs writeback. But since P2T1 and
>      P2T2 cannot release memory because memory unmapping code cannot hold
>      mmap_sem for reading, KT1 waits forever.... OOM livelock completed!
> 
> I think sequence shown below is also possible. Say, there are three
> userspace tasks named P1, P2, P3 and one kernel thread named KT1.
> 
> (1) P1 enters into kernel mode via write() syscall.
> 
> (2) P1 allocates memory for buffered write.
> 
> (3) P2 enters into kernel mode and holds mmap_sem for writing.
> 
> (4) P3 enters into kernel mode and calls kmalloc().
> 
> (5) P3 arrives at __alloc_pages_may_oom() because there was no
>     reclaimable memory. (Memory allocated by P1 is not reclaimable
>     as of this moment.)
> 
> (6) P1 dirties memory allocated for buffered write.
> 
> (7) KT1 finds dirtied memory.
> 
> (8) KT1 holds fs's unkillable lock for fs writeback.
> 
> (9) P2 calls kmalloc() and is blocked at unkillable lock for fs writeback
>     held by KT1.
> 
> (10) P3 calls out_of_memory() and the OOM killer chooses P2 and sets
>      TIF_MEMDIE on P2.
> 
> (11) P2 got TIF_MEMDIE but is blocked at unkillable lock for fs writeback
>      held by KT1.
> 
> (12) KT1 is trying to allocate memory for fs writeback. But since P2 cannot
>      release memory because memory unmapping code cannot hold mmap_sem for
>      reading, KT1 waits forever.... OOM livelock completed!
> 
> So, allowing all OOM victim threads to use memory reserves does not guarantee
> that a thread which held mmap_sem for writing to make forward progress.
> 

Thank you for writing this all out, it definitely helps to understand the 
concerns.

This, in my understanding, is the same scenario that requires not only oom 
victims to be able to access memory reserves, but also any thread after an 
oom victim has failed to make a timely exit.

I point out mm->mmap_sem as a special case because we have had fixes in 
the past, such as the special fatal_signal_pending() handling in 
__get_user_pages(), that try to ensure forward progress since we know that 
we need exclusive mm->mmap_sem for the victim to make an exit.

I think both of your illustrations show why it is not helpful to kill 
additional processes after a time period has elapsed and a victim has 
failed to exit.  In both of your scenarios, it would require that KT1 be 
killed to allow forward progress and we know that's not possible.

Perhaps this is an argument that we need to provide access to memory 
reserves for threads even for !__GFP_WAIT and !__GFP_FS in such scenarios, 
but I would wait to make that extension until we see it in practice.

Killing all mm->mmap_sem threads certainly isn't meant to solve all oom 
killer livelocks, as you show.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/