linux-kernel - Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190411210200.GH2217@ZenIV.linux.org.uk>
Date:   Thu, 11 Apr 2019 22:02:00 +0100
From:   Al Viro <viro@...iv.linux.org.uk>
To:     "Tobin C. Harding" <me@...in.cc>
Cc:     "Tobin C. Harding" <tobin@...nel.org>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Roman Gushchin <guro@...com>,
        Alexander Viro <viro@....linux.org.uk>,
        Christoph Hellwig <hch@...radead.org>,
        Pekka Enberg <penberg@...helsinki.fi>,
        David Rientjes <rientjes@...gle.com>,
        Joonsoo Kim <iamjoonsoo.kim@....com>,
        Christopher Lameter <cl@...ux.com>,
        Matthew Wilcox <willy@...radead.org>,
        Miklos Szeredi <mszeredi@...hat.com>,
        Andreas Dilger <adilger@...ger.ca>,
        Waiman Long <longman@...hat.com>,
        Tycho Andersen <tycho@...ho.ws>, Theodore Ts'o <tytso@....edu>,
        Andi Kleen <ak@...ux.intel.com>,
        David Chinner <david@...morbit.com>,
        Nick Piggin <npiggin@...il.com>,
        Rik van Riel <riel@...hat.com>,
        Hugh Dickins <hughd@...gle.com>,
        Jonathan Corbet <corbet@....net>, linux-mm@...ck.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        Linus Torvalds <torvalds@...ux-foundation.org>
Subject: Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab
 Movable Objects

On Thu, Apr 11, 2019 at 05:47:46AM +0100, Al Viro wrote:

> Note, BTW, that umount coming between isolate and drop is not a problem;
> it call shrink_dcache_parent() on the root.  And if shrink_dcache_parent()
> finds something on (another) shrink list, it won't put it to the shrink
> list of its own, but it will make note of that and repeat the scan in
> such case.  So if we find something with zero refcount and not on
> shrink list, we can move it to our shrink list and be sure that its
> superblock won't go away under us...

Aaaarrgghhh...  No, we can't.  Look: we get one candidate dentry in isolate
phase.  We put it into shrink list.  umount(2) comes and calls
shrink_dcache_for_umount(), which calls shrink_dcache_parent(root).
In the meanwhile, shrink_dentry_list() is run and does __dentry_kill() on
that one dentry.  Fine, it's gone - before shrink_dcache_parent() even
sees it.  Now shrink_dentry_list() holds a reference to its parent and
is about to drop it in
                dentry = parent;
                while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
                        dentry = dentry_kill(dentry);
And dropped it will be, but... shrink_dcache_parent() has finished the
scan, without finding *anything* with zero refcount - the thing that used
to be on the shrink list was already gone before shrink_dcache_parent()
has gotten there and the reference to parent was not dropped yet.  So
shrink_dcache_for_umount() plows past shrink_dcache_parent(), walks the
tree and complains loudly about "busy" dentries (that parent we hadn't
finished dropping), and then we proceed with filesystem shutdown.
In the meanwhile, dentry_kill() finally gets to killing dentry and
triggers an unexpected late call of ->d_iput() on a filesystem that
has already been far enough into shutdown - far enough to destroy the
data structures needed for that sucker.

The reason we don't hit that problem with regular memory shrinker is
this:
                unregister_shrinker(&s->s_shrink);
                fs->kill_sb(s);
in deactivate_locked_super().  IOW, shrinker for this fs is gone
before we get around to shutdown.  And so are all normal sources
of dentry eviction for that fs.

Your earlier variants all suffer the same problem - picking a page
shared by dentries from several superblocks can run into trouble
if it overlaps with umount of one of those.

Fuck...  One variant of solution would be to have per-superblock
struct kmem_cache to be used for dentries of that superblock.
However,
	* we'd need to prevent them getting merged
	* it would add per-superblock memory costs (for struct
kmem_cache and associated structures)
	* it might mean more pages eaten by the dentries -
on average half a page per superblock (more if there are very
few dentries on that superblock)

OTOH, it might actually improve the memory footprint - all
dentries sharing a page would be from the same superblock,
so the use patterns might be more similar, which might
lower the fragmentation...

Hell knows...  I'd like to hear an opinion from VM folks on
that one.  Comments?