linux-kernel - Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180320143452.GT18129@dastard>
Date:   Wed, 21 Mar 2018 01:34:52 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Kirill Tkhai <ktkhai@...tuozzo.com>
Cc:     Michal Hocko <mhocko@...nel.org>, darrick.wong@...cle.com,
        linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
        akpm@...ux-foundation.org
Subject: Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in
 case of global reclaim

On Tue, Mar 20, 2018 at 04:15:16PM +0300, Kirill Tkhai wrote:
> On 20.03.2018 03:18, Dave Chinner wrote:
> > On Mon, Mar 19, 2018 at 02:06:01PM +0300, Kirill Tkhai wrote:
> >> On 17.03.2018 00:39, Dave Chinner wrote:
> > Actually, it is fair, because:
> > 
> >         /* proportion the scan between the caches */
> >         dentries = mult_frac(sc->nr_to_scan, dentries, total_objects);
> >         inodes = mult_frac(sc->nr_to_scan, inodes, total_objects);
> >         fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects);
> > 
> > This means that if the number of objects in the memcg aware VFS
> > caches are tiny compared to the global XFS inode cache, then they
> > only get a tiny amount of the total scanning done by the shrinker.
> 
> This is just wrong. If you have two memcgs, the first is charged
> by xfs dentries and inodes, while the second is not charged by xfs
> dentries and inodes, the second will response for xfs shrink as well
> as the first.

That makes no sense to me. Shrinkers are per-sb, so they only
shrink per-filesystem dentries and inodes. If your memcgs are on
different filesystems, then they'll behave according to the size of
that filesystem's caches, not the cache of some other filesystem.

> When we call shrink xfs partition of the second memcg, total_objects
> in super_cache_count() will consist only of cached objects.
> 
> Then, we will call super_cache_scan() in the loop, and 
> 
> total_objects = dentries + inodes + fs_objects + 1 = fs_objects
> s_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects) ~ almost equal to sc->nr_to_scan.
> 
> So, if the first memcg is huge and it's charged 10GB to cached objects,
> the second memcg will shrink signify part of this objects (depending
> on do_shrink_slab() priority) in case of reclaim coming.

Do we really care which memcg does the global reclaim work as long
as it all gets done in a timely manner? The fact is that when there
is global memory pressure, shrinkers from every memcg will be doing
work all the time and so it really does not matter which shrinker
context does what work. In such cases, memory reclaim is all about
bulk throughput, and we trade of instantenous fairness for overall
fairness and higher bulk throughput.

And, well, the bulk of XFS inode reclaim is not done by the shrinker
contexts - memcg or otherwise - it's done by an asycnhronous,
non-blocking per-filesystem background work which is kicked by a
shrinker scan being run on that filesystem . The only reason we keep
the shrinkers doing direct work is to throttle direct reclaim to the
overall background inode reclaim rate. It's hardly fair for the
filesystem to automatically do the work the memcg should be doing
and doing it faster than the memcg reclaim can do it, is it?

FWIW, inodes are RCU objects, so they aren't freed until RCU grace
periods expire and that means memcg reclaim does not actually free
inodes immediately. They are always delayed to some time after the
memcg shrinker has run and reported it freed those objects.  IOWs,
it really doesn't matter which memcg context does the "freeing"
work, because the freeing of inode memory in any given memcg
actually occurs some time later from callbacks run in a global,
non-memcg aware context....

> > And because all of the memcg shrinkers walk the global inode cache,
> > the overall scanning is shared between all memcgs under memory
> > pressure evenly and so none are unfairly penalised by having to
> > scan the global inode cache to trigger final garbage collection.
> 
> So, if there is single memcg using xfs, the rest of system is
> responsible for its shrinking. Completely unfair.

If there is only one memcg that is generating internal memory
pressure, then only that memcg will be running the shrinker and
reclaiming dentries, inodes and XFS inodes from that memcg. Nothing
else will be running reclaim and hence only the memcg shrinker will
be doing reclaim work. I don't see where you get the "rest of the
system is responsible for it's shrinking" from - that just isn't the
way the shrinker infrastructure works, regardless of whether XFS is
involved or not.

> Isn't this easy to understand?

First rule of shrinkers: Shrinkers *aren't easy to understand.*
Second rule of shrinkers: See the first rule.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com