[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171026114100.tfb3xemvumg2a7su@dhcp22.suse.cz>
Date: Thu, 26 Oct 2017 13:41:00 +0200
From: Michal Hocko <mhocko@...nel.org>
To: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Cc: akpm@...ux-foundation.org, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, Cong Wang <xiyou.wangcong@...il.com>,
Dave Hansen <dave.hansen@...el.com>,
Johannes Weiner <hannes@...xchg.org>,
Mel Gorman <mgorman@...e.de>, Petr Mladek <pmladek@...e.com>,
Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
Vlastimil Babka <vbabka@...e.cz>,
"yuwang.yuwang" <yuwang.yuwang@...baba-inc.com>
Subject: Re: [PATCH] mm: don't warn about allocations which stall for too long
On Thu 26-10-17 20:28:59, Tetsuo Handa wrote:
> Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too
> long") was a great step for reducing possibility of silent hang up problem
> caused by memory allocation stalls. But this commit reverts it, for it is
> possible to trigger OOM lockup and/or soft lockups when many threads
> concurrently called warn_alloc() (in order to warn about memory allocation
> stalls) due to current implementation of printk(), and it is difficult to
> obtain useful information due to limitation of synchronous warning
> approach.
>
> Current printk() implementation flushes all pending logs using the context
> of a thread which called console_unlock(). printk() should be able to flush
> all pending logs eventually unless somebody continues appending to printk()
> buffer.
>
> Since warn_alloc() started appending to printk() buffer while waiting for
> oom_kill_process() to make forward progress when oom_kill_process() is
> processing pending logs, it became possible for warn_alloc() to force
> oom_kill_process() loop inside printk(). As a result, warn_alloc()
> significantly increased possibility of preventing oom_kill_process() from
> making forward progress.
>
> ---------- Pseudo code start ----------
> Before warn_alloc() was introduced:
>
> retry:
> if (mutex_trylock(&oom_lock)) {
> while (atomic_read(&printk_pending_logs) > 0) {
> atomic_dec(&printk_pending_logs);
> print_one_log();
> }
> // Send SIGKILL here.
> mutex_unlock(&oom_lock)
> }
> goto retry;
>
> After warn_alloc() was introduced:
>
> retry:
> if (mutex_trylock(&oom_lock)) {
> while (atomic_read(&printk_pending_logs) > 0) {
> atomic_dec(&printk_pending_logs);
> print_one_log();
> }
> // Send SIGKILL here.
> mutex_unlock(&oom_lock)
> } else if (waited_for_10seconds()) {
> atomic_inc(&printk_pending_logs);
> }
> goto retry;
> ---------- Pseudo code end ----------
>
> Although waited_for_10seconds() becomes true once per 10 seconds, unbounded
> number of threads can call waited_for_10seconds() at the same time. Also,
> since threads doing waited_for_10seconds() keep doing almost busy loop, the
> thread doing print_one_log() can use little CPU resource. Therefore, this
> situation can be simplified like
>
> ---------- Pseudo code start ----------
> retry:
> if (mutex_trylock(&oom_lock)) {
> while (atomic_read(&printk_pending_logs) > 0) {
> atomic_dec(&printk_pending_logs);
> print_one_log();
> }
> // Send SIGKILL here.
> mutex_unlock(&oom_lock)
> } else {
> atomic_inc(&printk_pending_logs);
> }
> goto retry;
> ---------- Pseudo code end ----------
>
> when printk() is called faster than print_one_log() can process a log.
>
> One of possible mitigation would be to introduce a new lock in order to
> make sure that no other series of printk() (either oom_kill_process() or
> warn_alloc()) can append to printk() buffer when one series of printk()
> (either oom_kill_process() or warn_alloc()) is already in progress. Such
> serialization will also help obtaining kernel messages in readable form.
>
> ---------- Pseudo code start ----------
> retry:
> if (mutex_trylock(&oom_lock)) {
> mutex_lock(&oom_printk_lock);
> while (atomic_read(&printk_pending_logs) > 0) {
> atomic_dec(&printk_pending_logs);
> print_one_log();
> }
> // Send SIGKILL here.
> mutex_unlock(&oom_printk_lock);
> mutex_unlock(&oom_lock)
> } else {
> if (mutex_trylock(&oom_printk_lock)) {
> atomic_inc(&printk_pending_logs);
> mutex_unlock(&oom_printk_lock);
> }
> }
> goto retry;
> ---------- Pseudo code end ----------
>
> But this commit does not go that direction, for we don't want to introduce
> a new lock dependency, and we unlikely be able to obtain useful information
> even if we serialized oom_kill_process() and warn_alloc().
>
> Synchronous approach is prone to unexpected results (e.g. too late [1], too
> frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped
> with providing information other than "something is going wrong".
> I want to consider asynchronous approach which can obtain information
> during stalls with possibly relevant threads (e.g. the owner of oom_lock
> and kswapd-like threads) and serve as a trigger for actions (e.g. turn
> on/off tracepoints, ask libvirt daemon to take a memory dump of stalling
> KVM guest for diagnostic purpose).
>
> This commit temporarily looses ability to report e.g. OOM lockup due to
> unable to invoke the OOM killer due to !__GFP_FS allocation request.
> But asynchronous approach will be able to detect such situation and emit
> warning. Thus, let's remove warn_alloc().
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
> [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
> [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))
>
> Signed-off-by: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
> Reported-by: Cong Wang <xiyou.wangcong@...il.com>
> Reported-by: yuwang.yuwang <yuwang.yuwang@...baba-inc.com>
> Reported-by: Johannes Weiner <hannes@...xchg.org>
> Cc: Michal Hocko <mhocko@...nel.org>
> Cc: Vlastimil Babka <vbabka@...e.cz>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Dave Hansen <dave.hansen@...el.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@...il.com>
> Cc: Petr Mladek <pmladek@...e.com>
The changelog is a bit excessive but it points out the real problem is
that printk is simply not ready for the sync warning which is good to
document and I hope that this will get addressed in a foreseeable future.
For the mean time it is simply better to remove the warning rather than
risk more issues.
Acked-by: Michal Hocko <mhocko@...e.com>
> ---
> mm/page_alloc.c | 10 ----------
> 1 file changed, 10 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 97687b3..a4edfba 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3856,8 +3856,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> enum compact_result compact_result;
> int compaction_retries;
> int no_progress_loops;
> - unsigned long alloc_start = jiffies;
> - unsigned int stall_timeout = 10 * HZ;
> unsigned int cpuset_mems_cookie;
> int reserve_flags;
>
> @@ -3989,14 +3987,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> if (!can_direct_reclaim)
> goto nopage;
>
> - /* Make sure we know about allocations which stall for too long */
> - if (time_after(jiffies, alloc_start + stall_timeout)) {
> - warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> - "page allocation stalls for %ums, order:%u",
> - jiffies_to_msecs(jiffies-alloc_start), order);
> - stall_timeout += 10 * HZ;
> - }
> -
> /* Avoid recursion of direct reclaim */
> if (current->flags & PF_MEMALLOC)
> goto nopage;
> --
> 1.8.3.1
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists