linux-kernel - Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <201602151958.HCJ48972.FFOFOLMHSQVJtO@I-love.SAKURA.ne.jp>
Date:	Mon, 15 Feb 2016 19:58:50 +0900
From:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:	hannes@...xchg.org, mhocko@...nel.org
Cc:	akpm@...ux-foundation.org, mgorman@...e.de, rientjes@...gle.com,
	torvalds@...ux-foundation.org, oleg@...hat.com, hughd@...gle.com,
	andrea@...nel.org, riel@...hat.com, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org, mhocko@...e.com
Subject: Re: [PATCH 3/2] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space

Andrew Morton wrote:
> 
> The patch titled
>      Subject: mm/oom_kill.c: don't ignore oom score on exiting tasks
> has been removed from the -mm tree.  Its filename was
>      mm-oom_killc-dont-skip-pf_exiting-tasks-when-searching-for-a-victim.patch
> 
> This patch was dropped because an updated version will be merged
> 
> ------------------------------------------------------
> From: Johannes Weiner <hannes@...xchg.org>
> Subject: mm/oom_kill.c: don't ignore oom score on exiting tasks
> 
> When the OOM killer scans tasks and encounters a PF_EXITING one, it
> force-selects that one regardless of the score.  Is there a possibility
> that the task might hang after it has set PF_EXITING?  In that case the
> OOM killer should be able to move on to the next task.
> 
> Frankly, I don't even know why we check for exiting tasks in the OOM
> killer.  We've tried direct reclaim at least 15 times by the time we
> decide the system is OOM, there was plenty of time to exit and free
> memory; and a task might exit voluntarily right after we issue a kill. 
> This is testing pure noise.
> 

I can't find updated version of this patch in linux-next. Why don't you submit?
I think the patch description should be updated because this patch solves yet
another silent OOM livelock bug.

Say, there is a process with two threads named Thread1 and Thread2.
Since the OOM killer sets TIF_MEMDIE only on the first non-NULL mm task,
it is possible that Thread2 invokes the OOM killer and Thread1 gets
TIF_MEMDIE (without sending SIGKILL to processes using Thread1's mm).

----------
Thread1                       Thread2
                              Calls mmap()
Calls _exit(0)
                              Arrives at vm_mmap_pgoff()
Arrives at do_exit()
Gets PF_EXITING via exit_signals()
                              Calls down_write(&mm->mmap_sem)
                              Calls do_mmap_pgoff()
Calls down_read(&mm->mmap_sem) from exit_mm()
                              Does a GFP_KERNEL allocation
                              Calls out_of_memory()
                              oom_scan_process_thread(Thread1) returns OOM_SCAN_ABORT

down_read(&mm->mmap_sem) is waiting for Thread2 to call up_write(&mm->mmap_sem)
                              but Thread2 is waiting for Thread1 to set Thread1->mm = NULL ... silent OOM livelock!
----------

The OOM reaper tries to avoid this livelock by using down_read_trylock()
instead of down_read(), but core_state check in exit_mm() cannot avoid this
livelock unless we use non-blocking allocation (i.e. GFP_ATOMIC or GFP_NOWAIT)
for allocations between down_write(&mm->mmap_sem) and up_write(&mm->mmap_sem).

I think that the same problem exists for any task_will_free_mem()-based
optimizations such as

        if (current->mm &&
            (fatal_signal_pending(current) || task_will_free_mem(current))) {
                mark_oom_victim(current);
                return true;
        }

in out_of_memory() and

        task_lock(p);
        if (p->mm && task_will_free_mem(p)) {
                mark_oom_victim(p);
                task_unlock(p);
                put_task_struct(p);
                return;
        }
        task_unlock(p);

in oom_kill_process() and

        if (fatal_signal_pending(current) || task_will_free_mem(current)) {
                mark_oom_victim(current);
                goto unlock;
        }

in mem_cgroup_out_of_memory().

Well, what are possible callers of task_will_free_mem(current) between getting
PF_EXITING and doing current->mm = NULL ? tty_audit_exit() seems to be an example
which does a GFP_KERNEL allocation from tty_audit_log() and can be later blocked
at down_read() in exit_mm() after TIF_MEMDIE is set at tty_audit_log() called from
tty_audit_exit() ?

Is task_will_free_mem(current) possible for mem_cgroup_out_of_memory() case?