linux-kernel - Re: [PATCH v2 0/7] mm: introduce shrinker debugfs interface

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YmioDchdnpIh8HlC@carbon>
Date:   Tue, 26 Apr 2022 19:18:53 -0700
From:   Roman Gushchin <roman.gushchin@...ux.dev>
To:     Dave Chinner <dchinner@...hat.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, Yang Shi <shy828301@...il.com>,
        Kent Overstreet <kent.overstreet@...il.com>,
        Hillf Danton <hdanton@...a.com>
Subject: Re: [PATCH v2 0/7] mm: introduce shrinker debugfs interface

On Wed, Apr 27, 2022 at 11:22:55AM +1000, Dave Chinner wrote:
> On Tue, Apr 26, 2022 at 09:41:34AM -0700, Roman Gushchin wrote:
> > Can you, please, summarize your position, because it's a bit unclear.
> > You made a lot of good points about some details (e.g. shrinkers naming,
> > and I totally agree there; machines with hundreds of nodes etc), then
> > you said the active scanning is useless and then said the whole thing
> > is useless and we're fine with what we have regarding shrinkers debugging.
> 
> Better introspection the first thing we need. Work on improving
> that. I've been making suggestions to help improve introspection
> infrastructure.
> 
> Before anything else, we need to improve introspection so we can
> gain better insight into the problems we have. Once we understand
> the problems better and have evidence to back up where the problems
> lie and we have a plan to solve them, then we can talk about whether
> we need other user accessible shrinker APIs.

Ok, at least we do agree here.

This is exactly why I've started with this debugfs stuff.

> 
> For the moment, exposing shrinker control interfaces to userspace
> could potentially be very bad because it exposes internal
> architectural and implementation details to a user API.  Just
> because it is in /sys/kernel/debug it doesn't mean applications
> won't start to use it and build dependencies on it.
> 
> That doesn't mean I'm opposed to exposing a shrinker control
> mechanism to debugfs - I'm still on the fence on that one. However,
> I definitely think that an API that directly exposes the internal
> implementation to userspace is the wrong way to go about this.

Ok, if it's about having memcg-aware and other interfaces, I can
agree here as well.

I actually made an attempt to unify memcg-aware and system-wide
shrinker scanning, not very successful yet, but it's definitely
on my todo list. I'm pretty sure we're iterating over and over
some empty root-level shrinkers without benefiting the bitmap
infrastructure which works for memory cgroups.

> 
> Fine grained shrinker control is not necessary to improve shrinker
> introspection and OOM debugging capability, so if you want/need
> control interfaces then I think you should separate those out into a
> separate line of development where it doesn't derail the discussion
> on how to improve shrinker/OOM introspection.

Ok, no problems here. Btw, tem OOM debugging is a separate topic brought
in by Kent, I'd keep it separate too, as it comes with many OOM-specific
complications.

>From your another email:
> So, yeah, you need to think about how to do fine-grained access to
> shrinker stats effectively. That might require a complete change of
> presentation API. For example, changing the filesystem layout to be
> memcg centric rather than shrinker instance centric would make an
> awful lot of this file parsing problem go away.
>
> e.g:
>
> /sys/kernel/debug/mm/memcg/<memcg instance>/shrinker/<shrinker instance>/stats

The problem with this approach (I though about it) is that it comes
with a high memory overhead especially on that machine with thousands cgroups
and mount points. And beside the memory overhead, it's really expensive to
collect system-wide data and get a big picture, as it requires opening
and reading of thousand of files.

Actually, you wrote recently:
"I've thought about it, too, and can see where it could be useful.
However, when I consider the list_lru memcg integration, I suspect
it becomes a "can't see the forest for the trees" problem. We're
going to end up with millions of sysfs objects with no obvious way
to navigate, iterate or search them if we just take the naive "sysfs
object + stats per list_lru instance" approach."

It all makes me think we need both: a way to iterate over all memcgs and dump
all the numbers at once and a way to get a specific per-memcg (per-node) count.

Thanks!