linux-kernel - Re: [PATCH 00/10] OOM Debug print selection and additional information

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20190828065955.GB7386@dhcp22.suse.cz>
Date:   Wed, 28 Aug 2019 08:59:55 +0200
From:   Michal Hocko <mhocko@...nel.org>
To:     Edward Chron <echron@...sta.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Roman Gushchin <guro@...com>,
        Johannes Weiner <hannes@...xchg.org>,
        David Rientjes <rientjes@...gle.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Shakeel Butt <shakeelb@...gle.com>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Ivan Delalande <colona@...sta.com>
Subject: Re: [PATCH 00/10] OOM Debug print selection and additional
 information

On Tue 27-08-19 18:07:54, Edward Chron wrote:
> On Tue, Aug 27, 2019 at 12:15 AM Michal Hocko <mhocko@...nel.org> wrote:
> >
> > On Mon 26-08-19 12:36:28, Edward Chron wrote:
> > [...]
> > > Extensibility using OOM debug options
> > > -------------------------------------
> > > What is needed is an extensible system to optionally configure
> > > debug options as needed and to then dynamically enable and disable
> > > them. Also for options that produce multiple lines of entry based
> > > output, to configure which entries to print based on how much
> > > memory they use (or optionally all the entries).
> >
> > With a patch this large and adding a lot of new stuff we need a more
> > detailed usecases described I believe.
> 
> I guess it would make sense to explain motivation for each OOM Debug
> option I've sent separately.
> I see there comments on the patches I will try and add more information there.
> 
> An overview would be that we've been collecting information on OOM's
> over the last 12 years or so.
> These are from switches, other embedded devices, servers both large and small.
> We ask for feedback on what information was helpful or could be helpful.
> We try and add it to make root causing issues easier.
> 
> These OOM debug options are some of the options we've created.
> I didn't port all of them to 5.3 but these are representative.
> Our latest is kernel is a bit behind 5.3.
> 
> >
> >
> > [...]
> >
> > > Use of debugfs to allow dynamic controls
> > > ----------------------------------------
> > > By providing a debugfs interface that allows options to be configured,
> > > enabled and where appropriate to set a minimum size for selecting
> > > entries to print, the output produced when an OOM event occurs can be
> > > dynamically adjusted to produce as little or as much detail as needed
> > > for a given system.
> >
> > Who is going to consume this information and why would that consumer be
> > unreasonable to demand further maintenance of that information in future
> > releases? In other words debugfs is not considered a stableAPI which is
> > OK here but the side effect of any change to these files results in user
> > visible behavior and we consider that more or less a stable as long as
> > there are consumers.
> >
> > > OOM debug options can be added to the base code as needed.
> > >
> > > Currently we have the following OOM debug options defined:
> > >
> > > * System State Summary
> > >   --------------------
> > >   One line of output that includes:
> > >   - Uptime (days, hour, minutes, seconds)
> >
> > We do have timestamps in the log so why is this needed?
> 
> 
> Here is how an OOM report looks when we get it to look at:
> 
> Aug 26 09:06:34 coronado kernel: oomprocs invoked oom-killer:
> gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0,
> oom_score_adj=1000
> Aug 26 09:06:34 coronado kernel: CPU: 1 PID: 2795 Comm: oomprocs Not
> tainted 5.3.0-rc6+ #33
> Aug 26 09:06:34 coronado kernel: Hardware name: Compulab Ltd.
> IPC3/IPC3, BIOS 5.12_IPC3K.PRD.0.25.7 08/09/2018
> 
> This shows the date and time, not time of the last boot. The
> /var/log/messages output is what we often have to look at not raw
> dmesgs.

This looks more like a configuration of the logging than a kernel
problem. Kernel does provide timestamps for logs. E.g.
$ tail -n1 /var/log/kern.log
Aug 28 08:27:46 tiehlicka kernel: <1054>[336340.954345] systemd-udevd[7971]: link_config: autonegotiation is unset or enabled, the speed and duplex are not writable.

[...]
> > >   Example output when configured and enabled:
> > >
> > > Jul 22 15:20:57 yoursystem kernel: Threads:530 Processes:279 forks_since_boot:2786 procs_runable:2 procs_iowait:0
> > >
> > > * ARP Table and/or Neighbour Discovery Table Summary
> > >   --------------------------------------------------
> > >   One line of output each for ARP and ND that includes:
> > >   - Table name
> > >   - Table size (max # entries)
> > >   - Key Length
> > >   - Entry Size
> > >   - Number of Entries
> > >   - Last Flush (in seconds)
> > >   - hash grows
> > >   - entry allocations
> > >   - entry destroys
> > >   - Number lookups
> > >   - Number of lookup hits
> > >   - Resolution failures
> > >   - Garbage Collection Forced Runs
> > >   - Table Full
> > >   - Proxy Queue Length
> > >
> > >   Example output when configured and enabled (for both):
> > >
> > > ... kernel: neighbour: Table: arp_tbl size:   256 keyLen:  4 entrySize: 360 entries:     9 lastFlush:  1721s hGrows:     1 allocs:     9 destroys:     0 lookups:   204 hits:   199 resFailed:    38 gcRuns/Forced: 111 /  0 tblFull:  0 proxyQlen:  0
> > >
> > > ... kernel: neighbour: Table:  nd_tbl size:   128 keyLen: 16 entrySize: 368 entries:     6 lastFlush:  1720s hGrows:     0 allocs:     7 destroys:     1 lookups:     0 hits:     0 resFailed:     0 gcRuns/Forced: 110 /  0 tblFull:  0 proxyQlen:  0
> >
> > Again, why is this needed particularly for the OOM event? I do
> > understand this might be useful system health diagnostic information but
> > how does this contribute to the OOM?
> >
> 
> It is example of some system table information we print.
> Other adjustable table information may be useful as well.
> These table sizes are often adjustable and collecting stats on usage
> helps determine if settings are appropriate.
> The value during OOM events is very useful as usage varies.
> We also collect the same stats like this from user code periodically
> and can compare these.

I suspect that this is a very narrow usecase and there are more like
that and I can imagine somebody with a different workload could come up
with yet another set of useful information to print. The more I think of these
additional modules the more I am convinced that this "plugin" architecture
is a wrong approach. Why? Mostly because all the code maintenance burden
is likely to be not worth all the niche usecase. This all has to be more
dynamic and ideally scriptable so that the code in the kernel just
provides the basic information and everybody can just hook in there and
dump whatever additional information is needed. Sounds like something
that eBPF could fit in, no? Have you considered that?

[...]

Skipping over many useful stuff. I can reassure you that my experience
with OOM debugging has been a real pain at times (e.g. when there is
simply no way to find out who has eaten all the memory because it is not
accounted anywhere) as well and I completely understand where you are
coming from. There is definitely a room for improvements we just have to
find a way how to get there.

Thanks!
-- 
Michal Hocko
SUSE Labs