linux-kernel - Re: [patch 9/9] mm: keep page cache radix tree nodes in check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20131127005948.GD10988@dastard>
Date:	Wed, 27 Nov 2013 11:59:48 +1100
From:	Dave Chinner <david@...morbit.com>
To:	Johannes Weiner <hannes@...xchg.org>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Rik van Riel <riel@...hat.com>, Jan Kara <jack@...e.cz>,
	Vlastimil Babka <vbabka@...e.cz>,
	Peter Zijlstra <peterz@...radead.org>,
	Tejun Heo <tj@...nel.org>, Andi Kleen <andi@...stfloor.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Greg Thelen <gthelen@...gle.com>,
	Christoph Hellwig <hch@...radead.org>,
	Hugh Dickins <hughd@...gle.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Mel Gorman <mgorman@...e.de>,
	Minchan Kim <minchan.kim@...il.com>,
	Michel Lespinasse <walken@...gle.com>,
	Seth Jennings <sjenning@...ux.vnet.ibm.com>,
	Roman Gushchin <klamm@...dex-team.ru>,
	Ozgun Erdogan <ozgun@...usdata.com>,
	Metin Doslu <metin@...usdata.com>, linux-mm@...ck.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [patch 9/9] mm: keep page cache radix tree nodes in check

On Tue, Nov 26, 2013 at 06:00:10PM -0500, Johannes Weiner wrote:
> On Wed, Nov 27, 2013 at 09:29:37AM +1100, Dave Chinner wrote:
> > On Tue, Nov 26, 2013 at 04:27:25PM -0500, Johannes Weiner wrote:
> > > On Tue, Nov 26, 2013 at 10:49:21AM +1100, Dave Chinner wrote:
> > > > On Sun, Nov 24, 2013 at 06:38:28PM -0500, Johannes Weiner wrote:
> > > > > Previously, page cache radix tree nodes were freed after reclaim
> > > > > emptied out their page pointers.  But now reclaim stores shadow
> > > > > entries in their place, which are only reclaimed when the inodes
> > > > > themselves are reclaimed.  This is problematic for bigger files that
> > > > > are still in use after they have a significant amount of their cache
> > > > > reclaimed, without any of those pages actually refaulting.  The shadow
> > > > > entries will just sit there and waste memory.  In the worst case, the
> > > > > shadow entries will accumulate until the machine runs out of memory.
> > ....
> > > > ....
> > > > > +	radix_tree_replace_slot(slot, page);
> > > > > +	if (node) {
> > > > > +		node->count++;
> > > > > +		/* Installed page, can't be shadow-only anymore */
> > > > > +		if (!list_empty(&node->lru))
> > > > > +			list_lru_del(&workingset_shadow_nodes, &node->lru);
> > > > > +	}
> > > > > +	return 0;
> > > > 
> > > > Hmmmmm - what's the overhead of direct management of LRU removal
> > > > here? Most list_lru code uses lazy removal (i.e. via the shrinker)
> > > > to avoid having to touch the LRU when adding new references to an
> > > > object.....
> > > 
> > > It's measurable in microbenchmarks, but not when any real IO is
> > > involved.  The difference was in the noise even on SSD drives.
> > 
> > Well, it's not an SSD or two I'm worried about - it's devices that
> > can do millions of IOPS where this is likely to be noticable...
> > 
> > > The other list_lru users see items only once they become unused and
> > > subsequent references are expected to be few and temporary, right?
> > 
> > They go onto the list when the refcount falls to zero, but reuse can
> > be frequent when being referenced repeatedly by a single user. That
> > avoids every reuse from removing the object from the LRU then
> > putting it back on the LRU for every reference cycle...
> 
> That's true, but it's less of a concern in the radix_tree_node case
> because it takes a full inactive list cycle after a refault before the
> node is put back on the LRU.  Or a really unlikely placed partial node
> truncation/invalidation (full truncation would just delete the whole
> node anyway).

OK, fair enough. We can deal with the problem if we see it being a
limitation.

> > > We expect pages to refault in spades on certain loads, at which point
> > > we may have thousands of those nodes on the list that are no longer
> > > reclaimable (10k nodes for about 2.5G of cache).
> > 
> > Sure, look at the way the inode and dentry caches work - entire
> > caches of millions of inodes and dentries often sit on the LRUs. A
> > quick look at my workstations dentry cache shows:
> > 
> > $ at /proc/sys/fs/dentry-state 
> > 180108  170596  45      0       0       0
> > 
> > 180k allocated dentries, 170k sitting on the LRU...
> 
> Hm, and a significant amount of those 170k could rotate on the next
> shrinker scan due to recent references or do you generally have
> smaller spikes?

I see very little dentry/inode reclaim because the shrinker tends to
skip most inodes and dentries because they have the referenced bit
set on them whenever the shrinker runs. i.e. that's the working set,
and it gets maintained pretty well...

> But as per above I think the case for lazily removing shadow nodes is
> less convincing than for inodes and dentries.

Agreed.

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/