linux-kernel - Re: [RFC] respect the referenced bit of KVM guest pages?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090815054524.GB11387@localhost>
Date:	Sat, 15 Aug 2009 13:45:24 +0800
From:	Wu Fengguang <fengguang.wu@...el.com>
To:	Rik van Riel <riel@...hat.com>
Cc:	Johannes Weiner <hannes@...xchg.org>, Avi Kivity <avi@...hat.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Andrea Arcangeli <aarcange@...hat.com>,
	"Dike, Jeffrey G" <jeffrey.g.dike@...el.com>,
	"Yu, Wilfred" <wilfred.yu@...el.com>,
	"Kleen, Andi" <andi.kleen@...el.com>,
	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Christoph Lameter <cl@...ux-foundation.org>,
	Mel Gorman <mel@....ul.ie>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-mm <linux-mm@...ck.org>
Subject: Re: [RFC] respect the referenced bit of KVM guest pages?

On Fri, Aug 14, 2009 at 09:19:35PM +0800, Rik van Riel wrote:
> Wu Fengguang wrote:
> > On Fri, Aug 14, 2009 at 05:10:55PM +0800, Johannes Weiner wrote:
> 
> >> So even with the active list being a FIFO, we keep usage information
> >> gathered from the inactive list.  If we deactivate pages in arbitrary
> >> list intervals, we throw this away.
> > 
> > We do have the danger of FIFO, if inactive list is small enough, so
> > that (unconditionally) deactivated pages quickly get reclaimed and
> > their life window in inactive list is too small to be useful.
> 
> This one of the reasons why we unconditionally deactivate
> the active anon pages, and do background scanning of the
> active anon list when reclaiming page cache pages.
> 
> We want to always move some pages to the inactive anon
> list, so it does not get too small.

Right, the current code tries to pull inactive list out of
smallish-size state as long as there are vmscan activities.

However there is a possible (and tricky) hole: mem cgroups
don't do batched vmscan. shrink_zone() may call shrink_list()
with nr_to_scan=1, in which case shrink_list() _still_ calls
isolate_pages() with the much larger SWAP_CLUSTER_MAX.

It effectively scales up the inactive list scan rate by 10 times when
it is still small, and may thus prevent it from growing up for ever.

In that case, LRU becomes FIFO.

Jeff, can you confirm if the mem cgroup's inactive list is small?
If so, this patch should help.

Thanks,
Fengguang
---

mm: do batched scans for mem_cgroup

Signed-off-by: Wu Fengguang <fengguang.wu@...el.com>
---
 include/linux/memcontrol.h |    3 +++
 mm/memcontrol.c            |   12 ++++++++++++
 mm/vmscan.c                |    9 +++++----
 3 files changed, 20 insertions(+), 4 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2009-08-15 13:12:49.000000000 +0800
+++ linux/include/linux/memcontrol.h	2009-08-15 13:18:13.000000000 +0800
@@ -98,6 +98,9 @@ int mem_cgroup_inactive_file_is_low(stru
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
 				       struct zone *zone,
 				       enum lru_list lru);
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+					 struct zone *zone,
+					 enum lru_list lru);
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone);
 struct zone_reclaim_stat*
--- linux.orig/mm/memcontrol.c	2009-08-15 13:07:34.000000000 +0800
+++ linux/mm/memcontrol.c	2009-08-15 13:17:56.000000000 +0800
@@ -115,6 +115,7 @@ struct mem_cgroup_per_zone {
 	 */
 	struct list_head	lists[NR_LRU_LISTS];
 	unsigned long		count[NR_LRU_LISTS];
+	unsigned long		nr_saved_scan[NR_LRU_LISTS];
 
 	struct zone_reclaim_stat reclaim_stat;
 };
@@ -597,6 +598,17 @@ unsigned long mem_cgroup_zone_nr_pages(s
 	return MEM_CGROUP_ZSTAT(mz, lru);
 }
 
+unsigned long *mem_cgroup_get_saved_scan(struct mem_cgroup *memcg,
+					 struct zone *zone,
+					 enum lru_list lru)
+{
+	int nid = zone->zone_pgdat->node_id;
+	int zid = zone_idx(zone);
+	struct mem_cgroup_per_zone *mz = mem_cgroup_zoneinfo(memcg, nid, zid);
+
+	return &mz->nr_saved_scan[lru];
+}
+
 struct zone_reclaim_stat *mem_cgroup_get_reclaim_stat(struct mem_cgroup *memcg,
 						      struct zone *zone)
 {
--- linux.orig/mm/vmscan.c	2009-08-15 13:04:54.000000000 +0800
+++ linux/mm/vmscan.c	2009-08-15 13:19:03.000000000 +0800
@@ -1534,6 +1534,7 @@ static void shrink_zone(int priority, st
 	for_each_evictable_lru(l) {
 		int file = is_file_lru(l);
 		unsigned long scan;
+		unsigned long *saved_scan;
 
 		scan = zone_nr_pages(zone, sc, l);
 		if (priority || noswap) {
@@ -1541,11 +1542,11 @@ static void shrink_zone(int priority, st
 			scan = (scan * percent[file]) / 100;
 		}
 		if (scanning_global_lru(sc))
-			nr[l] = nr_scan_try_batch(scan,
-						  &zone->lru[l].nr_saved_scan,
-						  swap_cluster_max);
+			saved_scan = &zone->lru[l].nr_saved_scan;
 		else
-			nr[l] = scan;
+			saved_scan = mem_cgroup_get_saved_scan(sc->mem_cgroup,
+							       zone, l);
+		nr[l] = nr_scan_try_batch(scan, saved_scan, swap_cluster_max);
 	}
 
 	while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/