[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171102131421.5clordblejmoon76@dhcp22.suse.cz>
Date: Thu, 2 Nov 2017 14:14:21 +0100
From: Michal Hocko <mhocko@...nel.org>
To: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, David Rientjes <rientjes@...gle.com>,
Manish Jaggi <mjaggi@...iumnetworks.com>,
Oleg Nesterov <oleg@...hat.com>,
Vladimir Davydov <vdavydov.dev@...il.com>
Subject: Re: [PATCH v2 2/2] mm,oom: Use ALLOC_OOM for OOM victim's last
second allocation.
On Thu 02-11-17 20:16:48, Tetsuo Handa wrote:
> Manish Jaggi noticed that running LTP oom01/oom02 ltp tests with high core
> count causes random kernel panics when an OOM victim which consumed memory
> in a way the OOM reaper does not help was selected by the OOM killer [1].
>
> ----------
> oom02 0 TINFO : start OOM testing for mlocked pages.
> oom02 0 TINFO : expected victim is 4578.
> oom02 0 TINFO : thread (ffff8b0e71f0), allocating 3221225472 bytes.
> oom02 0 TINFO : thread (ffff8b8e71f0), allocating 3221225472 bytes.
> (...snipped...)
> oom02 0 TINFO : thread (ffff8a0e71f0), allocating 3221225472 bytes.
> [ 364.737486] oom02:4583 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0
> (...snipped...)
> [ 365.036127] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 365.044691] [ 1905] 0 1905 3236 1714 10 4 0 0 systemd-journal
> [ 365.054172] [ 1908] 0 1908 20247 590 8 4 0 0 lvmetad
> [ 365.062959] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd
> [ 365.072266] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd
> [ 365.080963] [ 3145] 0 3145 1086 630 6 4 0 0 systemd-logind
> [ 365.090353] [ 3146] 0 3146 1208 596 7 3 0 0 irqbalance
> [ 365.099413] [ 3147] 81 3147 1118 625 5 4 0 -900 dbus-daemon
> [ 365.108548] [ 3149] 998 3149 116294 4180 26 5 0 0 polkitd
> [ 365.117333] [ 3164] 997 3164 19992 785 9 3 0 0 chronyd
> [ 365.126118] [ 3180] 0 3180 55605 7880 29 3 0 0 firewalld
> [ 365.135075] [ 3187] 0 3187 87842 3033 26 3 0 0 NetworkManager
> [ 365.144465] [ 3290] 0 3290 43037 1224 16 5 0 0 rsyslogd
> [ 365.153335] [ 3295] 0 3295 108279 6617 30 3 0 0 tuned
> [ 365.161944] [ 3308] 0 3308 27846 676 11 3 0 0 crond
> [ 365.170554] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd
> [ 365.179076] [ 3371] 0 3371 27307 364 6 3 0 0 agetty
> [ 365.187790] [ 3375] 0 3375 29397 1125 11 3 0 0 login
> [ 365.196402] [ 4178] 0 4178 4797 1119 14 4 0 0 master
> [ 365.205101] [ 4209] 89 4209 4823 1396 12 4 0 0 pickup
> [ 365.213798] [ 4211] 89 4211 4842 1485 12 3 0 0 qmgr
> [ 365.222325] [ 4491] 0 4491 27965 1022 8 3 0 0 bash
> [ 365.230849] [ 4513] 0 4513 670 365 5 3 0 0 oom02
> [ 365.239459] [ 4578] 0 4578 37776030 32890957 64257 138 0 0 oom02
> [ 365.248067] Out of memory: Kill process 4578 (oom02) score 952 or sacrifice child
> [ 365.255581] Killed process 4578 (oom02) total-vm:151104120kB, anon-rss:131562528kB, file-rss:1300kB, shmem-rss:0kB
> [ 365.266829] out_of_memory: Current (4583) has a pending SIGKILL
> [ 365.267347] oom_reaper: reaped process 4578 (oom02), now anon-rss:131559616kB, file-rss:0kB, shmem-rss:0kB
> [ 365.282658] oom_reaper: reaped process 4583 (oom02), now anon-rss:131561664kB, file-rss:0kB, shmem-rss:0kB
> [ 365.283361] oom02:4586 invoked oom-killer: gfp_mask=0x16040c0(GFP_KERNEL|__GFP_COMP|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0
> (...snipped...)
> [ 365.576164] oom02:4585 invoked oom-killer: gfp_mask=0x16080c0(GFP_KERNEL|__GFP_ZERO|__GFP_NOTRACK), nodemask=1, order=0, oom_score_adj=0
> (...snipped...)
> [ 365.576298] [ pid ] uid tgid total_vm rss nr_ptes nr_pmds swapents oom_score_adj name
> [ 365.576338] [ 2421] 0 2421 3241 878 9 3 0 -1000 systemd-udevd
> [ 365.576342] [ 3125] 0 3125 3834 719 9 4 0 -1000 auditd
> [ 365.576347] [ 3309] 0 3309 3332 616 10 3 0 -1000 sshd
> [ 365.576356] [ 4580] 0 4578 37776030 32890417 64258 138 0 0 oom02
> [ 365.576361] Kernel panic - not syncing: Out of memory and no killable processes...
> ----------
>
> Since commit 696453e66630ad45 ("mm, oom: task_will_free_mem should skip
> oom_reaped tasks") changed task_will_free_mem(current) in out_of_memory()
> to return false as soon as MMF_OOM_SKIP is set, many threads sharing the
> victim's mm were not able to try allocation from memory reserves after the
> OOM reaper gave up reclaiming memory.
I think it would be really helpful to outline the race scenario as
suggested in http://lkml.kernel.org/r/20171101135855.bqg2kuj6ao2cicqi@dhcp22.suse.cz
> Until Linux 4.7, we were using
>
> if (current->mm &&
> (fatal_signal_pending(current) || task_will_free_mem(current)))
>
> as a condition to try allocation from memory reserves with the risk of OOM
> lockup, but reports like [1] were impossible. Linux 4.8+ are regressed
> compared to Linux 4.7 due to the risk of needlessly selecting more OOM
> victims.
I still believe that the wording makes the problem bigger than it
actually is and that might confuse people reading it.
> Although commit cd04ae1e2dc8e365 ("mm, oom: do not rely on TIF_MEMDIE for
> memory reserves access") mitigated this regression by not requiring each
> OOM victim thread to call task_will_free_mem(current), some of OOM victim
> threads which are between post __gfp_pfmemalloc_flags(gfp_mask) and pre
> mutex_trylock(&oom_lock) (a race window which that commit cannot close)
> can call out_of_memory() without ever trying ALLOC_OOM allocation.
and this just confuses even more
> Therefore, this patch allows OOM victims to use ALLOC_OOM watermark
> for last second allocation attempt.
Can we have something as simple as
To reduce the above time window this patch allows OOM victims to ue
ALLOC_OOM watermark for the last allocation attempt before invoking the
oom killer.
Other than that the patch looks good to me.
> [1] http://lkml.kernel.org/r/e6c83a26-1d59-4afd-55cf-04e58bdde188@caviumnetworks.com
>
> Fixes: 696453e66630ad45 ("mm, oom: task_will_free_mem should skip oom_reaped tasks")
> Reported-by: Manish Jaggi <mjaggi@...iumnetworks.com>
> Signed-off-by: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
> Cc: Michal Hocko <mhocko@...e.com>
> Cc: Oleg Nesterov <oleg@...hat.com>
> Cc: Vladimir Davydov <vdavydov.dev@...il.com>
> Cc: David Rientjes <rientjes@...gle.com>
> ---
> mm/page_alloc.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 1607326..45e763e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4111,13 +4111,19 @@ struct page *alloc_pages_before_oomkill(const struct oom_control *oc)
> * !__GFP_NORETRY allocation which will never fail due to oom_lock
> * already held. And since this allocation attempt does not sleep,
> * there is no reason we must use high watermark here.
> + * But anyway, make sure that OOM victims can try ALLOC_OOM watermark
> + * in case they haven't tried ALLOC_OOM watermark.
> */
> int alloc_flags = ALLOC_CPUSET | ALLOC_WMARK_HIGH;
> gfp_t gfp_mask = oc->gfp_mask | __GFP_HARDWALL;
> + int reserve_flags;
>
> if (!oc->ac)
> return NULL;
> gfp_mask &= ~__GFP_DIRECT_RECLAIM;
> + reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
> + if (reserve_flags)
> + alloc_flags = reserve_flags;
> return get_page_from_freelist(gfp_mask, oc->order, alloc_flags, oc->ac);
> }
>
> --
> 1.8.3.1
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists