[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200429090437.GX28637@dhcp22.suse.cz>
Date: Wed, 29 Apr 2020 11:04:37 +0200
From: Michal Hocko <mhocko@...nel.org>
To: Vlastimil Babka <vbabka@...e.cz>
Cc: David Rientjes <rientjes@...gle.com>,
Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org,
Mel Gorman <mgorman@...hsingularity.net>
Subject: Re: [patch] mm, oom: stop reclaiming if GFP_ATOMIC will start
failing soon
On Wed 29-04-20 09:51:39, Vlastimil Babka wrote:
> On 4/28/20 11:48 PM, David Rientjes wrote:
> > On Tue, 28 Apr 2020, Vlastimil Babka wrote:
> >
> > Yes, order-0 reclaim capture is interesting since the issue being reported
> > here is userspace going out to lunch because it loops for an unbounded
> > amount of time trying to get above a watermark where it's allowed to
> > allocate and other consumers are depleting that resource.
> >
> > We actually prefer to oom kill earlier rather than being put in a
> > perpetual state of aggressive reclaim that affects all allocators and the
> > unbounded nature of those allocations leads to very poor results for
> > everybody.
>
> Sure. My vague impression is that your (and similar cloud companies) kind of
> workloads are designed to maximize machine utilization, and overshooting and
> killing something as a result is no big deal. Then you perhaps have more
> probability of hitting this state, and on the other hand, even an occasional
> premature oom kill is not a big deal?
>
> My concers are workloads not designed in such a way, where premature oom kill
> due to temporary higher reclaim activity together with burst of incoming network
> packets will result in e.g. killing an important database. There, the tradeoff
> looks different.
Completely agreed! The in kernel OOM killer is to deal with situations
when memory is desperately depleted without any sign of a forward
progress. If there is a reclaimable memory then we are not there yet.
If a workload can benefit from early oom killing based on response time
then we have facilities to achieve that (e.g. PSI).
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists