linux-kernel - Re: OOM killer not nearly agressive enough?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200109212516.GA23620@dhcp22.suse.cz>
Date:   Thu, 9 Jan 2020 22:25:16 +0100
From:   Michal Hocko <mhocko@...nel.org>
To:     Pavel Machek <pavel@....cz>
Cc:     kernel list <linux-kernel@...r.kernel.org>,
        Andrew Morton <akpm@...l.org>, linux-mm@...ck.org,
        akpm@...ux-foundation.org
Subject: Re: OOM killer not nearly agressive enough?

On Thu 09-01-20 22:03:07, Pavel Machek wrote:
> On Thu 2020-01-09 12:56:33, Michal Hocko wrote:
> > On Tue 07-01-20 21:44:12, Pavel Machek wrote:
> > > Hi!
> > > 
> > > I updated my userspace to x86-64, and now chromium likes to eat all
> > > the memory and bring the system to standstill.
> > > 
> > > Unfortunately, OOM killer does not react:
> > > 
> > > I'm now running "ps aux", and it prints one line every 20 seconds or
> > > more. Do we agree that is "unusable" system? I attempted to do kill
> > > from other session.
> > 
> > Does sysrq+f help?
> 
> May try that next time.
> 
> > > Do we agree that OOM killer should have reacted way sooner?
> > 
> > This is impossible to answer without knowing what was going on at the
> > time. Was the system threshing over page cache/swap? In other words, is
> > the system completely out of memory or refaulting the working set all
> > the time because it doesn't fit into memory?
> 
> Swap was full, so "completely out of memory", I guess. Chromium does
> that fairly often :-(.

The oom heuristic is based on the reclaim failure. If the reclaim makes
some progress then the oom killer is not hit. Have a look at
should_reclaim_retry for more details.

> > > Is there something I can tweak to make it behave more reasonably?
> > 
> > PSI based early OOM killing might help. See https://github.com/facebookincubator/oomd
> 
> Um. Before doing that... is there some knob somewhere saying "hey
> oomkiller, one hour to recover machine is a bit too much, can you
> please react sooner"?

No, there is nothing like that.

> PSI is completely different system, but I guess
> I should attempt to tweak the existing one first...

PSI is measuring the cost of the allocation (among other things) and
that can give you some idea on how much time is spent to get memory.
Userspace can implement a policy based on that and act. The kernel oom
killer is the last resort when there is really no memory to allocate.
-- 
Michal Hocko
SUSE Labs