linux-kernel - Re: [PATCH 2/7] mm: shrinker: Add a .to

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZXAtxBKZmKhFxwYB@dread.disaster.area>
Date:   Wed, 6 Dec 2023 19:16:04 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Roman Gushchin <roman.gushchin@...ux.dev>
Cc:     Kent Overstreet <kent.overstreet@...ux.dev>,
        Qi Zheng <zhengqi.arch@...edance.com>,
        Michal Hocko <mhocko@...e.com>,
        Muchun Song <muchun.song@...ux.dev>,
        Linux-MM <linux-mm@...ck.org>, linux-kernel@...r.kernel.org,
        Andrew Morton <akpm@...ux-foundation.org>
Subject: Re: [PATCH 2/7] mm: shrinker: Add a .to_text() method for shrinkers

On Fri, Dec 01, 2023 at 12:01:33PM -0800, Roman Gushchin wrote:
> On Fri, Dec 01, 2023 at 12:18:44PM +1100, Dave Chinner wrote:
> > On Thu, Nov 30, 2023 at 11:01:23AM -0800, Roman Gushchin wrote:
> > > On Wed, Nov 29, 2023 at 10:21:49PM -0500, Kent Overstreet wrote:
> > > > On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote:
> > > > > For non-bcachefs developers, who knows what those statistics mean?
> > 
> > > Ok, a simple question then:
> > > why can't you dump /proc/slabinfo after the OOM?
> > 
> > Taken to it's logical conclusion, we arrive at:
> > 
> > 	OOM-kill doesn't need to output anything at all except for
> > 	what it killed because we can dump
> > 	/proc/{mem,zone,vmalloc,buddy,slab}info after the OOM....
> > 
> > As it is, even asking such a question shows that you haven't looked
> > at the OOM kill output for a long time - it already reports the slab
> > cache usage information for caches that are reclaimable.
> > 
> > That is, if too much accounted slab cache based memory consumption
> > is detected at OOM-kill, it will calldump_unreclaimable_slab() to
> > dump all the SLAB_RECLAIM_ACCOUNT caches (i.e. those with shrinkers)
> > to the console as part of the OOM-kill output.
> 
> You are right, I missed that, partially because most of OOM's I had to deal
> with recently were memcg OOM's.
> 
> This changes my perspective at Kent's patches, if we dump this information
> already, it might be not a bad idea to do it nicer. So I take my words back
> here.
> 
> > 
> > The problem Kent is trying to address is that this output *isn't
> > sufficient to debug shrinker based memory reclaim issues*. It hasn't
> > been for a long time, and so we've all got our own special debug
> > patches and methods for checking that shrinkers are doing what they
> > are supposed to. Kent is trying to formalise one of the more useful
> > general methods for exposing that internal information when OOM
> > occurs...
> > 
> > Indeed, I can think of several uses for a shrinker->to_text() output
> > that we simply cannot do right now.
> > 
> > Any shrinker that does garbage collection on something that is not a
> > pure slab cache (e.g. xfs buffer cache, xfs inode gc subsystem,
> > graphics memory allocators, binder, etc) has no visibility of the
> > actuall memory being used by the subsystem in the OOM-kill output.
> > This information isn't in /proc/slabinfo, it's not accounted by a
> > SLAB_RECLAIM_ACCOUNT cache, and it's not accounted by anything in
> > the core mm statistics.
> > 
> > e.g. How does anyone other than a XFS expert know that the 500k of
> > active xfs_buf handles in the slab cache actually pins 15GB of
> > cached metadata allocated directly from the page allocator, not just
> > the 150MB of slab cache the handles take up?
> > 
> > Another example is that an inode can pin lots of heap memory (e.g.
> > for in-memory extent lists) and that may not be freeable until the
> > inode is reclaimed. So while the slab cache might not be excesively
> > large, we might have an a million inodes with a billion cumulative
> > extents cached in memory and it is the heap memory consumed by the
> > cached extents that is consuming the 30GB of "missing" kernel memory
> > that is causing OOM-kills to occur.
> > 
> > How is a user or developer supposed to know when one of these
> > situations has occurred given the current lack of memory usage
> > introspection into subsystems?
> 
> What would be the proper solution to this problem from your point of view?
> What functionality/API mm can provide to make the life of fs developers
> better here?

What can we do better?

The first thing we can do better that comes to mind is to merge
Kent's patches that allow the shrinker owner to output debug
information when requested by the infrastructure.

Then we - the shrinker implementers - have some control of our own
destiny.  We can add whatever we need to solve shrinker and OOM
problems realted to our shrinkers not doing the right thing.

But without that callout from the infrastructure and the
infrastructure to drive it at appropriate times, we will make zero
progress improving the situation. 

Yes, the code may not be perfect and, yes, it may not be useful to
mm developers, but for the people who have to debug shrinker related
problems in production systems we need all the help we can get. We
certainly don't care if it isn't perfect, just having something we
can partially tailor to our iindividual needs is far, far better
than the current situation of nothing at all...

-Dave.
-- 
Dave Chinner
david@...morbit.com