linux-kernel - Re: [PATCH] mm, oom: allow oom reaper to race with exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <2d1fae74-04e1-213a-9139-4881a787525e@caviumnetworks.com>
Date:   Thu, 27 Jul 2017 13:59:09 +0530
From:   Manish Jaggi <mjaggi@...iumnetworks.com>
To:     linux-kernel@...r.kernel.org, Michal Hocko <mhocko@...e.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "Nair, Jayachandran" <Jayachandran.Nair@...ium.com>,
        Andrea Arcangeli <aarcange@...hat.com>,
        David Rientjes <rientjes@...gle.com>,
        Oleg Nesterov <oleg@...hat.com>,
        "Lomovtsev, Vadim" <Vadim.Lomovtsev@...ium.com>
Subject: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap

Hi Michal,

On Mon, Jul 24, 2017 at 09:23:32AM +0200, Michal Hocko wrote:
>From: Michal Hocko <mhocko@...e.com>
>
>David has noticed that the oom killer might kill additional tasks while
>the exiting oom victim hasn't terminated yet because the oom_reaper marks
>the curent victim MMF_OOM_SKIP too early when mm->mm_users dropped down
>to 0. The race is as follows
>
>oom_reap_task				do_exit
>					  exit_mm
>  __oom_reap_task_mm
>					    mmput
>					      __mmput
>    mmget_not_zero # fails
>    						exit_mmap # frees memory
>  set_bit(MMF_OOM_SKIP)
>
>The victim is still visible to the OOM killer until it is unhashed.
>
>Currently we try to reduce a risk of this race by taking oom_lock
>and wait for out_of_memory sleep while holding the lock to give the
>victim some time to exit. This is quite suboptimal approach because
>there is no guarantee the victim (especially a large one) will manage
>to unmap its address space and free enough memory to the particular oom
>domain which needs a memory (e.g. a specific NUMA node).
>
>Fix this problem by allowing __oom_reap_task_mm and __mmput path to
>race. __oom_reap_task_mm is basically MADV_DONTNEED and that is allowed
>to run in parallel with other unmappers (hence the mmap_sem for read).
>
>The only tricky part is to exclude page tables tear down and all
>operations which modify the address space in the __mmput path. exit_mmap
>doesn't expect any other users so it doesn't use any locking. Nothing
>really forbids us to use mmap_sem for write, though. In fact we are
>already relying on this lock earlier in the __mmput path to synchronize
>with ksm and khugepaged.
>
>Take the exclusive mmap_sem when calling free_pgtables and destroying
>vmas to sync with __oom_reap_task_mm which take the lock for read. All
>other operations can safely race with the parallel unmap.
>
>Changes
>- bail on null mm->mmap early as per David Rientjes

With 4.11.6 I was getting random kernel panics (Out of memory - No process left to kill),
  when running LTP oom01 /oom02 ltp tests on our arm64 hardware with ~256G memory and high core count.
The issue experienced was as follows
	that either test (oom01/oom02) selected a pid as victim and waited for the pid to be killed.
	that pid was marked as killed but somewhere there is a race and the process didnt get killed.
	and the oom01/oom02 test started killing further processes, till it panics.

IIUC this issue is quite similar to your patch description. But applying your patch I still see the issue.
If it is not related to this patch, can you please suggest by looking at the log, what could be preventing
the killing of victim.

Log (https://pastebin.com/hg5iXRj2)

As a subtest of oom02 starts, it prints out the victim - In this case 4578

oom02       0  TINFO  :  start OOM testing for mlocked pages.
oom02       0  TINFO  :  expected victim is 4578.

When oom02 thread invokes oom-killer, it did select 4578  for killing...


[  364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
[...] snip
[  365.036127] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  365.044691] [ 1905]     0  1905     3236     1714      10       4        0             0 systemd-journal
[...] snip
[  365.222325] [ 4491]     0  4491    27965     1022       8       3        0             0 bash
[  365.230849] [ 4513]     0  4513      670      365       5       3        0             0 oom02
[  365.239459] [ 4578]     0  4578 37776030 32890957   64257     138        0             0 oom02
[  365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child
[  365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB
[  365.266829] out_of_memory: Current (4583) has a pending SIGKILL
[  365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB
[  365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB

==> At this point, the test should have completed with a TPASS or TFAIL, but it didnt and it continues invoking oom-killer again.

[  365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
[  365.283368] oom02 cpuset=/ mems_allowed=0-1

later it panics...
[  365.576298] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[  365.576338] [ 2421]     0  2421     3241      878       9       3        0         -1000 systemd-udevd
[  365.576342] [ 3125]     0  3125     3834      719       9       4        0         -1000 auditd
[  365.576347] [ 3309]     0  3309     3332      616      10       3        0         -1000 sshd
[  365.576356] [ 4580]     0  4578 37776030 32890417   64258     138        0             0 oom02
[  365.576361] Kernel panic - not syncing: Out of memory and no killable processes...
[  365.576361]

-Thanks
Manish Jaggi