linux-kernel - Re: [PATCH 0/3] OOM detection rework v4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20151216153513.e432dc70e035e5d07984710c@linux-foundation.org>
Date:	Wed, 16 Dec 2015 15:35:13 -0800
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Michal Hocko <mhocko@...nel.org>
Cc:	Linus Torvalds <torvalds@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Mel Gorman <mgorman@...e.de>,
	David Rientjes <rientjes@...gle.com>,
	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>,
	Hillf Danton <hillf.zj@...baba-inc.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	<linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/3] OOM detection rework v4

On Tue, 15 Dec 2015 19:19:43 +0100 Michal Hocko <mhocko@...nel.org> wrote:

> This is an attempt to make the OOM detection more deterministic and
> easier to follow because each reclaimer basically tracks its own
> progress which is implemented at the page allocator layer rather spread
> out between the allocator and the reclaim. The more on the implementation
> is described in the first patch.

We've been futzing with this stuff for many years and it still isn't
working well.  This makes me expect that the new implementation will
take a long time to settle in.

To aid and accelerate this process I suggest we lard this code up with
lots of debug info, so when someone reports an issue we have the best
possible chance of understanding what went wrong.

This is easy in the case of oom-too-early - it's all slowpath code and
we can just do printk(everything).  It's not so easy in the case of
oom-too-late-or-never.  The reporter's machine just hangs or it
twiddles thumbs for five minutes then goes oom.  But there are things
we can do here as well, such as:

- add an automatic "nearly oom" detection which detects when things
  start going wrong and turns on diagnostics (this would need an enable
  knob, possibly in debugfs).

- forget about an autodetector and simply add a debugfs knob to turn on
  the diagnostics.

- sprinkle tracepoints everywhere and provide a set of
  instructions/scripts so that people who know nothing about kernel
  internals or tracing can easily gather the info we need to understand
  issues.

- add a sysrq key to turn on diagnostics.  Pretty essential when the
  machine is comatose and doesn't respond to keystrokes.

- something else

So...  please have a think about it?  What can we add in here to make it
as easy as possible for us (ie: you ;)) to get this code working well? 
At this time, too much developer support code will be better than too
little.  We can take it out later on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/