[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140205015352.GW6963@cmpxchg.org>
Date: Tue, 4 Feb 2014 20:53:52 -0500
From: Johannes Weiner <hannes@...xchg.org>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Andi Kleen <andi@...stfloor.org>,
Andrea Arcangeli <aarcange@...hat.com>,
Bob Liu <bob.liu@...cle.com>,
Christoph Hellwig <hch@...radead.org>,
Dave Chinner <david@...morbit.com>,
Greg Thelen <gthelen@...gle.com>,
Hugh Dickins <hughd@...gle.com>, Jan Kara <jack@...e.cz>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
Luigi Semenzato <semenzato@...gle.com>,
Mel Gorman <mgorman@...e.de>,
Metin Doslu <metin@...usdata.com>,
Michel Lespinasse <walken@...gle.com>,
Minchan Kim <minchan.kim@...il.com>,
Ozgun Erdogan <ozgun@...usdata.com>,
Peter Zijlstra <peterz@...radead.org>,
Rik van Riel <riel@...hat.com>,
Roman Gushchin <klamm@...dex-team.ru>,
Ryan Mallon <rmallon@...il.com>, Tejun Heo <tj@...nel.org>,
Vlastimil Babka <vbabka@...e.cz>, linux-mm@...ck.org,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [patch 10/10] mm: keep page cache radix tree nodes in check
On Tue, Feb 04, 2014 at 03:07:56PM -0800, Andrew Morton wrote:
> On Mon, 3 Feb 2014 19:53:42 -0500 Johannes Weiner <hannes@...xchg.org> wrote:
>
> > Previously, page cache radix tree nodes were freed after reclaim
> > emptied out their page pointers. But now reclaim stores shadow
> > entries in their place, which are only reclaimed when the inodes
> > themselves are reclaimed. This is problematic for bigger files that
> > are still in use after they have a significant amount of their cache
> > reclaimed, without any of those pages actually refaulting. The shadow
> > entries will just sit there and waste memory. In the worst case, the
> > shadow entries will accumulate until the machine runs out of memory.
> >
> > To get this under control, the VM will track radix tree nodes
> > exclusively containing shadow entries on a per-NUMA node list.
> > Per-NUMA rather than global because we expect the radix tree nodes
> > themselves to be allocated node-locally and we want to reduce
> > cross-node references of otherwise independent cache workloads. A
> > simple shrinker will then reclaim these nodes on memory pressure.
^^^^^^^^^^^^^^^
> > A few things need to be stored in the radix tree node to implement the
> > shadow node LRU and allow tree deletions coming from the list:
> >
> > 1. There is no index available that would describe the reverse path
> > from the node up to the tree root, which is needed to perform a
> > deletion. To solve this, encode in each node its offset inside the
> > parent. This can be stored in the unused upper bits of the same
> > member that stores the node's height at no extra space cost.
> >
> > 2. The number of shadow entries needs to be counted in addition to the
> > regular entries, to quickly detect when the node is ready to go to
> > the shadow node LRU list. The current entry count is an unsigned
> > int but the maximum number of entries is 64, so a shadow counter
> > can easily be stored in the unused upper bits.
> >
> > 3. Tree modification needs tree lock and tree root, which are located
> > in the address space, so store an address_space backpointer in the
> > node. The parent pointer of the node is in a union with the 2-word
> > rcu_head, so the backpointer comes at no extra cost as well.
> >
> > 4. The node needs to be linked to an LRU list, which requires a list
> > head inside the node. This does increase the size of the node, but
> > it does not change the number of objects that fit into a slab page.
>
> changelog forgot to mention that this reclaim is performed via a
> shrinker...
Uhm... see above? :)
> How expensive is that list walk in scan_shadow_nodes()? I assume in
> the best case it will bale out after nr_to_scan iterations?
Yes, it scans sc->nr_to_scan radix tree nodes, cleans their pointers,
and frees them.
I ran a worst-case scenario on an 8G machine that creates one 8T
sparse file and faults one page per 64-page radix tree node, i.e. one
node per sparse file fault at CPU speed. The profile:
1 9.21% radixblow [kernel.kallsyms] [k] memset
2 7.23% radixblow [kernel.kallsyms] [k] do_mpage_readpage
3 4.76% radixblow [kernel.kallsyms] [k] copy_user_generic_string
4 3.85% radixblow [kernel.kallsyms] [k] __radix_tree_lookup
5 3.32% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate
6 2.92% radixblow [kernel.kallsyms] [k] get_page_from_freelist
7 2.81% kswapd0 [kernel.kallsyms] [k] __delete_from_page_cache
8 2.50% radixblow [kernel.kallsyms] [k] radix_tree_node_ctor
9 1.79% radixblow [kernel.kallsyms] [k] _raw_spin_lock_irq
10 1.70% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
Same scenario with 4 pages per 64-page radix tree node:
13 1.39% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate
16 pages per 64-page node:
75 0.20% kswapd0 [kernel.kallsyms] [k] shadow_lru_isolate
So I doubt this will bother anyone, especially since most use-once
streamers should have a better population density and populate cache
at disk speed, not CPU speed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists