linux-kernel - Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180316213910.GF7000@dastard>
Date:   Sat, 17 Mar 2018 08:39:10 +1100
From:   Dave Chinner <david@...morbit.com>
To:     Kirill Tkhai <ktkhai@...tuozzo.com>
Cc:     Michal Hocko <mhocko@...nel.org>, darrick.wong@...cle.com,
        linux-xfs@...r.kernel.org, linux-kernel@...r.kernel.org,
        akpm@...ux-foundation.org
Subject: Re: [PATCH RFC] xfs, memcg: Call xfs_fs_nr_cached_objects() only in
 case of global reclaim

On Fri, Mar 16, 2018 at 11:55:30AM +0300, Kirill Tkhai wrote:
> On 16.03.2018 02:03, Dave Chinner wrote:
> > On Thu, Mar 15, 2018 at 10:28:43PM +0300, Kirill Tkhai wrote:
> >> On 15.03.2018 20:49, Michal Hocko wrote:
> >>> On Thu 15-03-18 18:01:34, Kirill Tkhai wrote:
> >>>> xfs_reclaim_inodes_count(XFS_M(sb)) does not care about memcg.
> >>>> So, it's called for memcg reclaim too, e.g. this list is shrinked
> >>>> disproportionality to another lists.
> >>>>
> >>>> This looks confusing, so I'm reporting about this.
> >>>> Consider this patch as RFC.
> >>>
> >>> Could you be more specific about the problem you are trying to solve?
> >>> Because we do skip shrinkers which are not memcg aware by
> >>> shrink_slab:
> >>> 		/*
> >>> 		 * If kernel memory accounting is disabled, we ignore
> >>> 		 * SHRINKER_MEMCG_AWARE flag and call all shrinkers
> >>> 		 * passing NULL for memcg.
> >>> 		 */
> >>> 		if (memcg_kmem_enabled() &&
> >>> 		    !!memcg != !!(shrinker->flags & SHRINKER_MEMCG_AWARE))
> >>> 			continue;
> >>>
> >>> Or am I missing something?
> >>
> >> sb->s_op->nr_cached_objects is a sub-method of generic super_cache_count().
> >> super_cache_count() is owned and only called by superblock's shrinker,
> >> which does have SHRINKER_MEMCG_AWARE flag.
> > 
> > Xfs inodes are accounted to memcgs when they are allocated.  All the
> > memcg reclaim stuff is done at the VFS inode cache level - all the
> > XFS inode cache shrinker does is clean up inodes that are not
> > referenced by the VFS inode cache because the memcg aware reclaim
> > has already freed them.
> > 
> > i.e. what the XFS inode cache is doing is perfectly reasonable -
> > memcg aware inode reclaim is occuring at the VFS level, but on XFS
> > that does not free the inodes as they are still referenced
> > internally by XFS. However, once the inode is removed from the VFS
> > LRU, all memcg information about the inode is destroyed, so there's
> > nothing in the XFS layers that cares about memcgs.
> 
> So, after inode is removed from LRU, memory still remains accounted
> to memcg till the time they are actually freed. I personally don't
> care, just to mention.
>  
> > Hence when the XFS inode shrinker then called to run a
> > garbage collection pass on unreferenced inodes - the inodes that
> > are now unreferenced in the memcg due to the VFS inode shrinker pass
> > - it frees inodes regardless of the memcg context it was called from
> > because that information is no longer present in the inode cache.
> > Hence we just ignore memcgs at this level.
> 
> But xfs_fs_free_cached_objects() returns number of these freed object
> as result to super_cache_scan(), so shrinker interprets them as related
> to a memcg, while they may be related to another memcg. This introduces
> a disproportion relatively to another shrinkers called to memcg.

In what way? All memcgs see tha same values from the backing cache
and so try to do the same amount of scanning work. The shrinker
accounting simply doesn't care where the objects are scanned from,
as long as it comes from the same place as the calculation of the
number of objects in the cache it's about to scan.

The memcg accounting, OTOH, is completely divorced from the
shrinker, so if it's still got too much slab memory accounted to it,
it will simply run the shrinker again and do some more memory
reclaim.

XFS does this for IO efficiency purposes, not because it's ideal
behaviour for memcgs. If we start doing exact memcg-only inode
reclaim, we're going back to the bad old days where inode reclaim
causes really nasty small random write storms that essentially
starve the storage subsystem of all other IO for seconds at a time.
That is a *far worse problem* from a system performance POV than
having the memcg memory reclaim run a couple more shrinker passes
and do a little more work....

> Is there a problem? This is what my patch about.

You've described some theoretical issue, but not described any user
visible or performance impacting behaviour that users actually care
about. What reproducable, observable behaviour does it fix/change?

> > So, is there a problem you are actually trying to fix, or is this
> > simply a "I don't understand how the superblock shrinkers work,
> > please explain" patch?
> 
> I work on some shrinker changes, and just want to understand they don't
> touch anything. 

Oh, goody, another person about to learn the hard way that shrinkers
are far more varied and complex than page reclaim :P

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com