lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <ZWk0dI0PISWBbbKr@dread.disaster.area> Date: Fri, 1 Dec 2023 12:18:44 +1100 From: Dave Chinner <david@...morbit.com> To: Roman Gushchin <roman.gushchin@...ux.dev> Cc: Kent Overstreet <kent.overstreet@...ux.dev>, Qi Zheng <zhengqi.arch@...edance.com>, Michal Hocko <mhocko@...e.com>, Muchun Song <muchun.song@...ux.dev>, Linux-MM <linux-mm@...ck.org>, linux-kernel@...r.kernel.org, Andrew Morton <akpm@...ux-foundation.org> Subject: Re: [PATCH 2/7] mm: shrinker: Add a .to_text() method for shrinkers On Thu, Nov 30, 2023 at 11:01:23AM -0800, Roman Gushchin wrote: > On Wed, Nov 29, 2023 at 10:21:49PM -0500, Kent Overstreet wrote: > > On Thu, Nov 30, 2023 at 11:09:42AM +0800, Qi Zheng wrote: > > > For non-bcachefs developers, who knows what those statistics mean? For non-mm developers, who knows what those internal mm state statistics mean? IOWs, a non-mm developer goes and asks a mm developer to help them decipher the output to determine what to do next. So why can't a mm developer go an ask a subsystem developer to tell them what the shrinker oom-kill output means? Such a question is a demonstration of an unconscious bias that prioritises internal mm stuff as far more important than what anyone else outside core-mm might ever need... > > > You can use BPF or drgn to traverse in advance to get the address of the > > > bcachefs shrinker structure, and then during OOM, find the bcachefs > > > private structure through the shrinker->private_data member, and then > > > dump the bcachefs private data. Is there any problem with this? > > > > No, BPF is not an excuse for improving our OOM/allocation failure > > reports. BPF/tracing are secondary tools; whenever we're logging > > information about a problem we should strive to log enough information > > to debug the issue. > > Ok, a simple question then: > why can't you dump /proc/slabinfo after the OOM? Taken to it's logical conclusion, we arrive at: OOM-kill doesn't need to output anything at all except for what it killed because we can dump /proc/{mem,zone,vmalloc,buddy,slab}info after the OOM.... As it is, even asking such a question shows that you haven't looked at the OOM kill output for a long time - it already reports the slab cache usage information for caches that are reclaimable. That is, if too much accounted slab cache based memory consumption is detected at OOM-kill, it will calldump_unreclaimable_slab() to dump all the SLAB_RECLAIM_ACCOUNT caches (i.e. those with shrinkers) to the console as part of the OOM-kill output. The problem Kent is trying to address is that this output *isn't sufficient to debug shrinker based memory reclaim issues*. It hasn't been for a long time, and so we've all got our own special debug patches and methods for checking that shrinkers are doing what they are supposed to. Kent is trying to formalise one of the more useful general methods for exposing that internal information when OOM occurs... Indeed, I can think of several uses for a shrinker->to_text() output that we simply cannot do right now. Any shrinker that does garbage collection on something that is not a pure slab cache (e.g. xfs buffer cache, xfs inode gc subsystem, graphics memory allocators, binder, etc) has no visibility of the actuall memory being used by the subsystem in the OOM-kill output. This information isn't in /proc/slabinfo, it's not accounted by a SLAB_RECLAIM_ACCOUNT cache, and it's not accounted by anything in the core mm statistics. e.g. How does anyone other than a XFS expert know that the 500k of active xfs_buf handles in the slab cache actually pins 15GB of cached metadata allocated directly from the page allocator, not just the 150MB of slab cache the handles take up? Another example is that an inode can pin lots of heap memory (e.g. for in-memory extent lists) and that may not be freeable until the inode is reclaimed. So while the slab cache might not be excesively large, we might have an a million inodes with a billion cumulative extents cached in memory and it is the heap memory consumed by the cached extents that is consuming the 30GB of "missing" kernel memory that is causing OOM-kills to occur. How is a user or developer supposed to know when one of these situations has occurred given the current lack of memory usage introspection into subsystems? These are the sorts of situations that shrinker->to_text() would allow us to enumerate when it is necessary (i.e. at OOM-kill). At any other time, it just doesn't matter, but when we're at OOM having a mechanism to report somewhat accurate subsystem memory consumption would be very useful indeed. > Unlike anon memory, slab memory (fs caches in particular) should not be heavily > affected by killing some userspace task. Whether tasks get killed or not is completely irrelevant. The issue is that not all memory that is reclaimed by shrinkers is either pure slab cache memory or directly accounted as reclaimable to the mm subsystem.... -Dave. -- Dave Chinner david@...morbit.com
Powered by blists - more mailing lists