linux-kernel - Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220422083037.3pjdrusrn54fmfdf@moria.home.lan>
Date:   Fri, 22 Apr 2022 04:30:37 -0400
From:   Kent Overstreet <kent.overstreet@...il.com>
To:     Michal Hocko <mhocko@...e.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, roman.gushchin@...ux.dev,
        hannes@...xchg.org
Subject: Re: [PATCH 3/4] mm: Centralize & improve oom reporting in show_mem.c

On Fri, Apr 22, 2022 at 10:03:36AM +0200, Michal Hocko wrote:
> On Thu 21-04-22 14:42:13, Kent Overstreet wrote:
> > On Thu, Apr 21, 2022 at 11:18:20AM +0200, Michal Hocko wrote:
> [...]
> > > > 00177 16644 pages reserved
> > > > 00177 Unreclaimable slab info:
> > > > 00177 9p-fcall-cache    total: 8.25 MiB active: 8.25 MiB
> > > > 00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
> > > > 00177 kmalloc-64        total: 2.08 MiB active: 2.07 MiB
> > > > 00177 task_struct       total: 1.95 MiB active: 1.95 MiB
> > > > 00177 kmalloc-4k        total: 1.50 MiB active: 1.50 MiB
> > > > 00177 signal_cache      total: 1.34 MiB active: 1.34 MiB
> > > > 00177 kmalloc-2k        total: 1.16 MiB active: 1.16 MiB
> > > > 00177 bch_inode_info    total: 1.02 MiB active: 922 KiB
> > > > 00177 perf_event        total: 1.02 MiB active: 1.02 MiB
> > > > 00177 biovec-max        total: 992 KiB active: 960 KiB
> > > > 00177 Shrinkers:
> > > > 00177 super_cache_scan: objects: 127
> > > > 00177 super_cache_scan: objects: 106
> > > > 00177 jbd2_journal_shrink_scan: objects: 32
> > > > 00177 ext4_es_scan: objects: 32
> > > > 00177 bch2_btree_cache_scan: objects: 8
> > > > 00177   nr nodes:          24
> > > > 00177   nr dirty:          0
> > > > 00177   cannibalize lock:  0000000000000000
> > > > 00177 
> > > > 00177 super_cache_scan: objects: 8
> > > > 00177 super_cache_scan: objects: 1
> > > 
> > > How does this help to analyze this allocation failure?
> > 
> > You asked for an example of the output, which was an entirely reasonable
> > request. Shrinkers weren't responsible for this OOM, so it doesn't help here -
> 
> OK, do you have an example where it clearly helps?

I've debugged quite a few issues with shrinkers over the years where this would
have helped a lot (especially if it was also in sysfs), although nothing
currently. I was just talking with Dave earlier tonight about more things that
could be added for shrinkers, but I'm going to have to go over that conversation
again and take notes.

Also, I feel I have to point out that OOM & memory reclaim debugging is an area
where many filesystem developers feel that the MM people have been dropping the
ball, and your initial response to this patch series...  well, it feels like
more of the same.

Still does to be honest, you're coming across like I haven't been working in
this area for a decade+ and don't know what I'm touching. Really, I'm not new to
this stuff.

> > are you asking me to explain why shrinkers are relevant to OOMs and memory
> > reclaim...?
> 
> No, not really, I guess that is quite clear. The thing is that the oom
> report is quite bloated already and we should be rather picky on what to
> dump there. Your above example is a good one here. You have an order-5
> allocation failure and that can be caused by almost anything. Compaction
> not making progress for many reasons - e.g. internal framentation caused
> by pinned pages but also kmalloc allocations. The above output doesn't
> help with any of that. Could shrinkers operation be related? Of course
> it could but how can I tell?

Yeah sure and internal fragmentation would actually be an _excellent_ thing to
add to the show_mem report.

> We already dump slab data when the slab usage is excessive for the oom
> killer report and that was a very useful addition in many cases and it
> is bound to cases where slab consumption could be the primary source of
> the OOM condition.
> 
> That being said the additional output should be at least conditional and
> reported when there is a chance that it could help with analysis.

These things don't need to be conditional if we're more carefully selective
about how we report, instead of just dumping everything like we currently do
with slab info.

We don't need to report on all the slabs that are barely used - if you'll read
my patch and example output, which changes it to the top 10 slabs by memory
usage.

I feel like I keep repeating myself here. It would help if you would make more
of an effort to follow the full patch series and descriptions before commenting
generically.

> > Since shrinkers own and, critically, _are responsible for freeing memory_, a
> > shrinker not giving up memory when asked (or perhaps not being asked to give up
> > memory) can cause an OOM. A starting point - not an end - if we want to improve
> > OOM debugging is at least being able to see how much memory each shrinker owns.
> > Since we don't even have that, number of objects will have to do.
> > 
> > The reason for adding the .to_text() callback is that shrinkers have internal
> > state that affects whether they are able to give up objects when asked - the
> > primary example being object dirtyness.
> > 
> > If your system is using a ton of memory caching inodes, and something's wedged
> > writeback, and they're nearly all dirty - you're going to have a bad day.
> > 
> > The bcachefs btree node node shrinker included as an example of what we can do
> > with this: internally we may have to allocate new btree nodes by reclaiming from
> > our own cache, and we have a lock to prevent multiple threads from doing this at
> > the same time, and this lock also blocks the shrinker from freeing more memory
> > until we're done.
> > 
> > In filesystem land, debugging memory reclaim issues is a rather painful topic
> > that often comes up, this is a starting point...
> 
> I completely understand the frustration. I've been analyzing oom reports
> for years and I can tell that the existing report is quite good but
> in many cases the information we provide is still insufficient. My
> experience also tells me that those cases are usually very special and
> a specific data dumped for them wouldn't be all that useful in the
> majority of cases.
> 
> If we are lucky enough the oom is reproducible and additional
> tracepoints (or whatever your prefer to use) tell us more. Far from
> optimal, no question about that but I do not have a good answer on
> where the trashhold should really be. Maybe we can come up with some
> trigger based mechanism (e.g. some shrinkers are failing so they
> register their debugging data which will get dumped on the OOM) which
> would enable certain debugging information or something like that.

Why would we need a trigger mechanism?

Could you explain your objection to simply unconditionally dumping the top 10
slabs and the top 10 shrinkers?