linux-kernel - Re: [PATCH 2/7] mm: shrinker: Add a .to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZWk0dI0PISWBbbKr@dread.disaster.area>
Date:   Fri, 1 Dec 2023 12:18:44 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Roman Gushchin <roman.gushchin@...ux.dev>
Cc:     Kent Overstreet <kent.overstreet@...ux.dev>,
        Qi Zheng <zhengqi.arch@...edance.com>,
        Michal Hocko <mhocko@...e.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Linux-MM <linux-mm@...ck.org>, linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 2/7] mm: shrinker: Add a .to_text() method for shrinkers

On Thu, Nov 30, 2023 at 11:01:23AM -0800, Roman Gushchin wrote:
> On Wed, Nov 29, 2023 at 10:21:49PM -0500, Kent Overstreet wrote:
> > On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
> > > For non-bcachefs developers, who knows what those statistics mean?

For non-mm developers, who knows what those internal mm state
statistics mean?

IOWs, a non-mm developer goes and asks a mm developer to
help them decipher the output to determine what to do next.

So why can't a mm developer go an ask a subsystem developer to tell
them what the shrinker oom-kill output means?

Such a question is a demonstration of an unconscious bias
that prioritises internal mm stuff as far more important than what
anyone else outside core-mm might ever need...

> > > You can use BPF or drgn to traverse in advance to get the address of the
> > > bcachefs shrinker structure, and then during OOM, find the bcachefs
> > > private structure through the shrinker->private_data member, and then
> > > dump the bcachefs private data. Is there any problem with this?
> > 
> > No, BPF is not an excuse for improving our OOM/allocation failure
> > reports. BPF/tracing are secondary tools; whenever we're logging
> > information about a problem we should strive to log enough information
> > to debug the issue.
> 
> Ok, a simple question then:
> why can't you dump /proc/slabinfo after the OOM?

Taken to it's logical conclusion, we arrive at:

	OOM-kill doesn't need to output anything at all except for
	what it killed because we can dump
	/proc/{mem,zone,vmalloc,buddy,slab}info after the OOM....

As it is, even asking such a question shows that you haven't looked
at the OOM kill output for a long time - it already reports the slab
cache usage information for caches that are reclaimable.

That is, if too much accounted slab cache based memory consumption
is detected at OOM-kill, it will calldump_unreclaimable_slab() to
dump all the SLAB_RECLAIM_ACCOUNT caches (i.e. those with shrinkers)
to the console as part of the OOM-kill output.

The problem Kent is trying to address is that this output *isn't
sufficient to debug shrinker based memory reclaim issues*. It hasn't
been for a long time, and so we've all got our own special debug
patches and methods for checking that shrinkers are doing what they
are supposed to. Kent is trying to formalise one of the more useful
general methods for exposing that internal information when OOM
occurs...

Indeed, I can think of several uses for a shrinker->to_text() output
that we simply cannot do right now.

Any shrinker that does garbage collection on something that is not a
pure slab cache (e.g. xfs buffer cache, xfs inode gc subsystem,
graphics memory allocators, binder, etc) has no visibility of the
actuall memory being used by the subsystem in the OOM-kill output.
This information isn't in /proc/slabinfo, it's not accounted by a
SLAB_RECLAIM_ACCOUNT cache, and it's not accounted by anything in
the core mm statistics.

e.g. How does anyone other than a XFS expert know that the 500k of
active xfs_buf handles in the slab cache actually pins 15GB of
cached metadata allocated directly from the page allocator, not just
the 150MB of slab cache the handles take up?

Another example is that an inode can pin lots of heap memory (e.g.
for in-memory extent lists) and that may not be freeable until the
inode is reclaimed. So while the slab cache might not be excesively
large, we might have an a million inodes with a billion cumulative
extents cached in memory and it is the heap memory consumed by the
cached extents that is consuming the 30GB of "missing" kernel memory
that is causing OOM-kills to occur.

How is a user or developer supposed to know when one of these
situations has occurred given the current lack of memory usage
introspection into subsystems?

These are the sorts of situations that shrinker->to_text() would
allow us to enumerate when it is necessary (i.e. at OOM-kill). At
any other time, it just doesn't matter, but when we're at OOM having
a mechanism to report somewhat accurate subsystem memory consumption
would be very useful indeed.

> Unlike anon memory, slab memory (fs caches in particular) should not be heavily
> affected by killing some userspace task.

Whether tasks get killed or not is completely irrelevant. The issue
is that not all memory that is reclaimed by shrinkers is either pure
slab cache memory or directly accounted as reclaimable to the mm
subsystem....

-Dave.

-- 
Dave Chinner
david@...morbit.com