linux-kernel - Re: [patch 0/8] mm: thrash detection-based file cache sizing v5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131016223104.GA738@cmpxchg.org>
Date:	Wed, 16 Oct 2013 18:31:04 -0400
From:	Johannes Weiner <hannes@...xchg.org>
To:	Dave Chinner <david@...morbit.com>
Cc:	Rik van Riel <riel@...hat.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Andi Kleen <andi@...stfloor.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Greg Thelen <gthelen@...gle.com>,
	Christoph Hellwig <hch@...radead.org>,
	Hugh Dickins <hughd@...gle.com>, Jan Kara <jack@...e.cz>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Mel Gorman <mgorman@...e.de>,
	Minchan Kim <minchan.kim@...il.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Michel Lespinasse <walken@...gle.com>,
	Seth Jennings <sjenning@...ux.vnet.ibm.com>,
	Roman Gushchin <klamm@...dex-team.ru>,
	Ozgun Erdogan <ozgun@...usdata.com>,
	Metin Doslu <metin@...usdata.com>,
	Vlastimil Babka <vbabka@...e.cz>, Tejun Heo <tj@...nel.org>,
	linux-mm@...ck.org, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [patch 0/8] mm: thrash detection-based file cache sizing v5

On Wed, Oct 16, 2013 at 01:26:06PM +1100, Dave Chinner wrote:
> On Tue, Oct 15, 2013 at 10:05:26PM -0400, Rik van Riel wrote:
> > On 10/15/2013 07:41 PM, Dave Chinner wrote:
> > > On Tue, Oct 15, 2013 at 01:41:28PM -0400, Johannes Weiner wrote:
> > 
> > >> I'm not forgetting about them, I just track them very coarsely by
> > >> linking up address spaces and then lazily enforce their upper limit
> > >> when memory is tight by using the shrinker callback.  The assumption
> > >> was that actually scanning them is such a rare event that we trade the
> > >> rare computational costs for smaller memory consumption most of the
> > >> time.
> > > 
> > > Sure, I understand the tradeoff that you made. But there's nothing
> > > worse than a system that slows down unpredictably because of some
> > > magic threshold in some subsystem has been crossed and
> > > computationally expensive operations kick in.
> > 
> > The shadow shrinker should remove the radix nodes with
> > the oldest shadow entries first, so true LRU should actually
> > work for the radix tree nodes.
> > 
> > Actually, since we only care about the age of the youngest
> > shadow entry in each radix tree node, FIFO will be the same
> > as LRU for that list.
> > 
> > That means the shrinker can always just take the radix tree
> > nodes off the end.
> 
> Right, but it can't necessarily free the node as it may still have
> pointers to pages in it. In that case, it would have to simply
> rotate the page to the end of the LRU again.
> 
> Unless, of course, we kept track of the number of exceptional
> entries in a node and didn't add it to the reclaim list until there
> were no non-expceptional entries in the node....

I think that's doable.  As long as there is an actual page in the
node, no memory is going to be freed anyway.  And we have a full int
per node with only 64 slots, we can easily split the counter.

> > >> But it
> > >> looks like tracking radix tree nodes with a list and backpointers to
> > >> the mapping object for the lock etc. will be a major pain in the ass.
> > > 
> > > Perhaps so - it may not work out when we get down to the fine
> > > details...
> > 
> > I suspect that a combination of lifetime rules (inode cannot
> > disappear until all the radix tree nodes) and using RCU free
> > for the radix tree nodes, and the inodes might do the trick.
> > 
> > That would mean that, while holding the rcu read lock, the
> > back pointer from a radix tree node to the inode will always
> > point to valid memory.
> 
> Yes, that is what I was thinking...
> 
> > That allows the shrinker to lock the inode, and verify that
> > the inode is still valid, before it attempts to rcu free the
> > radix tree node with shadow entries.
> 
> Lock the mapping, not the inode. The radix tree is protected by the
> mapping_lock, not an inode lock. i.e. I'd hope that this can all b
> contained within the struct address_space and not require any
> knowledge of inodes or inode lifecycles at all.

Agreed, we can point to struct address_space and invalidate it by
setting mapping->host to NULL or so during the RCU grace period.

Also, the parent pointer is in a union with the two-word rcu_head, so
we get the address space backpointer for free.

The struct list_head for the FIFO, as you said, we can get for free as
well (at least on slab).

The FIFO lists themselves can be trimmed by a shrinker, I think.  You
were concerned about wind-up but if the nodes are not in excess and
->count just returns 0 until we are actually prepared to shrink
objects, then there shouldn't be any windup, right?

I don't see a natural threshold for "excess" yet, though, because
there is no relationship between where radix_tree_node is allocated
and which zones the contained shadow entries point to, so it's hard to
estimate how worthwile any given radix_tree_node is to a node.  Maybe
tying it to an event might be better, like reclaim pressure, swapping
etc.

> > It also means that locking only needs to be in the inode,
> > and on the LRU list for shadow radix tree nodes.
> > 
> > Does that sound sane?
> > 
> > Am I overlooking something?
> 
> It's pretty much along the same lines of what I was thinking, but
> lets see what Johannes thinks.

That sounds great to me.  I just have looked at this code for so long
that I'm getting blind towards it, so I really appreciate your input.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/