linux-kernel - Re: Possible race condition in oom-killer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <46e1e3ee-af9a-4e67-8b4b-5cf21478ad21@I-love.SAKURA.ne.jp>
Date:   Fri, 28 Jul 2017 21:59:50 +0900
From:   Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:     Michal Hocko <mhocko@...nel.org>,
        Manish Jaggi <mjaggi@...iumnetworks.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: Possible race condition in oom-killer

(Oops. Forgot to add CC.)

On 2017/07/28 21:32, Michal Hocko wrote:
> [CC linux-mm]
>
> On Fri 28-07-17 17:22:25, Manish Jaggi wrote:
>> was: Re: [PATCH] mm, oom: allow oom reaper to race with exit_mmap
>>
>> Hi Michal,
>> On 7/27/2017 2:54 PM, Michal Hocko wrote:
>>> On Thu 27-07-17 13:59:09, Manish Jaggi wrote:
>>> [...]
>>>> With 4.11.6 I was getting random kernel panics (Out of memory - No process left to kill),
>>>>  when running LTP oom01 /oom02 ltp tests on our arm64 hardware with ~256G memory and high core count.
>>>> The issue experienced was as follows
>>>> 	that either test (oom01/oom02) selected a pid as victim and waited for the pid to be killed.
>>>> 	that pid was marked as killed but somewhere there is a race and the process didnt get killed.
>>>> 	and the oom01/oom02 test started killing further processes, till it panics.
>>>> IIUC this issue is quite similar to your patch description. But applying your patch I still see the issue.
>>>> If it is not related to this patch, can you please suggest by looking at the log, what could be preventing
>>>> the killing of victim.
>>>>
>>>> Log (https://pastebin.com/hg5iXRj2)
>>>>
>>>> As a subtest of oom02 starts, it prints out the victim - In this case 4578
>>>>
>>>> oom02       0  TINFO  :  start OOM testing for mlocked pages.
>>>> oom02       0  TINFO  :  expected victim is 4578.
>>>>
>>>> When oom02 thread invokes oom-killer, it did select 4578  for killing...
>>> I will definitely have a look. Can you report it in a separate email
>>> thread please? Are you able to reproduce with the current Linus or
>>> linux-next trees?
>> Yes this issue is visible with linux-next.
>
> Could you provide the full kernel log from this run please? I do not
> expect there to be much difference but just to be sure that the code I
> am looking at matches logs.

4578 is consuming memory as mlocked pages. But the OOM reaper cannot reclaim
mlocked pages (i.e. can_madv_dontneed_vma() returns false due to VM_LOCKED), can it?

oom02       0  TINFO  :  start OOM testing for mlocked pages.
oom02       0  TINFO  :  expected victim is 4578.
[  365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB

As a result, MMF_OOM_SKIP is set without reclaiming much memory.
Thus, it is natural that subsequent OOM victims are selected immediately because
almost all memory is still in use. Since 4578 is multi-threaded (isn't it?),
it will take time to call final __mmput() because mm->users are large.
Since there are many threads, it is possible that all OOM killable processes are
killed before final __mmput() of 4578 (which releases mlocked pages) is called.

>
> [...]
>>>> [  365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1,  order=0, oom_score_adj=0
>>> Yes because
>>> [  365.283499] Node 1 Normal free:19500kB min:33804kB low:165916kB high:298028kB active_anon:13312kB inactive_anon:172kB active_file:0kB inactive_file:1044kB unevictable:131560064kB writepending:0kB present:134213632kB managed:132113248kB mlocked:131560064kB slab_reclaimable:5748kB slab_unreclaimable:17808kB kernel_stack:2720kB pagetables:254636kB bounce:0kB free_pcp:10476kB local_pcp:144kB free_cma:0kB
>>>
>>> Although we have killed and reaped oom02 process Node1 is still below
>>> min watermark and that is why we have hit the oom killer again. It
>>> is not immediatelly clear to me why, that would require a deeper
>>> inspection.
>> I have a doubt here
>> my understanding of oom test: oom() function basically forks itself and
>> starts n threads each thread has a loop which allocates and touches memory
>> thus will trigger oom-killer and will kill the process. the parent process
>> is on a wait() and will print pass/fail.
>>
>> So IIUC when 4578 is reaped all the child threads should be terminated,
>> which happens in pass case (line 152)
>> But even after being killed and reaped,  the oom killer is invoked again
>> which doesn't seem right.
>
> As I've said the OOM killer hits because the memory from Node 1 didn't
> get freed for some reasov or got immediatally populated.

Because of mlocked pages by multi threaded process, it will take time to
reclaim mlocked pages.

>
>> Could it be that the process is just marked hidden from oom including its
>> threads, thus oom-killer continues.
>
> The whole process should be killed and the OOM reaper should only mark
> the victim oom invisible _after_ the address space has been reaped (and
> memory freed). You said the patch from
> http://lkml.kernel.org/r/20170724072332.31903-1-mhocko@kernel.org didn't
> help so it shouldn't be a race with the last __mmput.
>
> Thanks!
>