linux-kernel - Re: [PATCH] mm: don't warn about allocations which stall for too long

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171026114100.tfb3xemvumg2a7su@dhcp22.suse.cz>
Date:   Thu, 26 Oct 2017 13:41:00 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Cc:     akpm@...ux-foundation.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Cong Wang <xiyou.wangcong@...il.com>,
        Dave Hansen <dave.hansen@...el.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Mel Gorman <mgorman@...e.de>, Petr Mladek <pmladek@...e.com>,
        Sergey Senozhatsky <sergey.senozhatsky@...il.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        "yuwang.yuwang" <yuwang.yuwang@...baba-inc.com>
Subject: Re: [PATCH] mm: don't warn about allocations which stall for too long

On Thu 26-10-17 20:28:59, Tetsuo Handa wrote:
> Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too
> long") was a great step for reducing possibility of silent hang up problem
> caused by memory allocation stalls. But this commit reverts it, for it is
> possible to trigger OOM lockup and/or soft lockups when many threads
> concurrently called warn_alloc() (in order to warn about memory allocation
> stalls) due to current implementation of printk(), and it is difficult to
> obtain useful information due to limitation of synchronous warning
> approach.
> 
> Current printk() implementation flushes all pending logs using the context
> of a thread which called console_unlock(). printk() should be able to flush
> all pending logs eventually unless somebody continues appending to printk()
> buffer.
> 
> Since warn_alloc() started appending to printk() buffer while waiting for
> oom_kill_process() to make forward progress when oom_kill_process() is
> processing pending logs, it became possible for warn_alloc() to force
> oom_kill_process() loop inside printk(). As a result, warn_alloc()
> significantly increased possibility of preventing oom_kill_process() from
> making forward progress.
> 
> ---------- Pseudo code start ----------
> Before warn_alloc() was introduced:
> 
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     }
>     goto retry;
> 
> After warn_alloc() was introduced:
> 
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     } else if (waited_for_10seconds()) {
>       atomic_inc(&printk_pending_logs);
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> Although waited_for_10seconds() becomes true once per 10 seconds, unbounded
> number of threads can call waited_for_10seconds() at the same time. Also,
> since threads doing waited_for_10seconds() keep doing almost busy loop, the
> thread doing print_one_log() can use little CPU resource. Therefore, this
> situation can be simplified like
> 
> ---------- Pseudo code start ----------
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     } else {
>       atomic_inc(&printk_pending_logs);
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> when printk() is called faster than print_one_log() can process a log.
> 
> One of possible mitigation would be to introduce a new lock in order to
> make sure that no other series of printk() (either oom_kill_process() or
> warn_alloc()) can append to printk() buffer when one series of printk()
> (either oom_kill_process() or warn_alloc()) is already in progress. Such
> serialization will also help obtaining kernel messages in readable form.
> 
> ---------- Pseudo code start ----------
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       mutex_lock(&oom_printk_lock);
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_printk_lock);
>       mutex_unlock(&oom_lock)
>     } else {
>       if (mutex_trylock(&oom_printk_lock)) {
>         atomic_inc(&printk_pending_logs);
>         mutex_unlock(&oom_printk_lock);
>       }
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> But this commit does not go that direction, for we don't want to introduce
> a new lock dependency, and we unlikely be able to obtain useful information
> even if we serialized oom_kill_process() and warn_alloc().
> 
> Synchronous approach is prone to unexpected results (e.g. too late [1], too
> frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped
> with providing information other than "something is going wrong".
> I want to consider asynchronous approach which can obtain information
> during stalls with possibly relevant threads (e.g. the owner of oom_lock
> and kswapd-like threads) and serve as a trigger for actions (e.g. turn
> on/off tracepoints, ask libvirt daemon to take a memory dump of stalling
> KVM guest for diagnostic purpose).
> 
> This commit temporarily looses ability to report e.g. OOM lockup due to
> unable to invoke the OOM killer due to !__GFP_FS allocation request.
> But asynchronous approach will be able to detect such situation and emit
> warning. Thus, let's remove warn_alloc().
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
> [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@mail.gmail.com
> [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
> Reported-by: Cong Wang <xiyou.wangcong@...il.com>
> Reported-by: yuwang.yuwang <yuwang.yuwang@...baba-inc.com>
> Reported-by: Johannes Weiner <hannes@...xchg.org>
> Cc: Michal Hocko <mhocko@...nel.org>
> Cc: Vlastimil Babka <vbabka@...e.cz>
> Cc: Mel Gorman <mgorman@...e.de>
> Cc: Dave Hansen <dave.hansen@...el.com>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@...il.com>
> Cc: Petr Mladek <pmladek@...e.com>

The changelog is a bit excessive but it points out the real problem is
that printk is simply not ready for the sync warning which is good to
document and I hope that this will get addressed in a foreseeable future.
For the mean time it is simply better to remove the warning rather than
risk more issues.

Acked-by: Michal Hocko <mhocko@...e.com>

> ---
>  mm/page_alloc.c | 10 ----------
>  1 file changed, 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 97687b3..a4edfba 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3856,8 +3856,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	enum compact_result compact_result;
>  	int compaction_retries;
>  	int no_progress_loops;
> -	unsigned long alloc_start = jiffies;
> -	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
>  	int reserve_flags;
>  
> @@ -3989,14 +3987,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (!can_direct_reclaim)
>  		goto nopage;
>  
> -	/* Make sure we know about allocations which stall for too long */
> -	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> -			"page allocation stalls for %ums, order:%u",
> -			jiffies_to_msecs(jiffies-alloc_start), order);
> -		stall_timeout += 10 * HZ;
> -	}
> -
>  	/* Avoid recursion of direct reclaim */
>  	if (current->flags & PF_MEMALLOC)
>  		goto nopage;
> -- 
> 1.8.3.1

-- 
Michal Hocko
SUSE Labs