linux-kernel - [RFC 1/2] deactive invalidated pages

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <bdd6628e81c06f6871983c971d91160fca3f8b5e.1290349672.git.minchan.kim@gmail.com>
Date:	Sun, 21 Nov 2010 23:30:23 +0900
From:	Minchan Kim <minchan.kim@...il.com>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	linux-mm <linux-mm@...ck.org>, LKML <linux-kernel@...r.kernel.org>,
	Minchan Kim <minchan.kim@...il.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Rik van Riel <riel@...hat.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Nick Piggin <npiggin@...nel.dk>
Subject: [RFC 1/2] deactive invalidated pages

Recently, there are reported problem about thrashing.
(http://marc.info/?l=rsync&m=128885034930933&w=2)
It happens by backup workloads(ex, nightly rsync).
That's because the workload makes just use-once pages
and touches pages twice. It promotes the page into
active list so that it results in working set page eviction.

Some app developer want to support POSIX_FADV_NOREUSE.
But other OSes don't support it, either.
(http://marc.info/?l=linux-mm&m=128928979512086&w=2)

By Other approach, app developer uses POSIX_FADV_DONTNEED.
But it has a problem. If kernel meets page is writing
during invalidate_mapping_pages, it can't work.
It is very hard for application programmer to use it.
Because they always have to sync data before calling
fadivse(..POSIX_FADV_DONTNEED) to make sure the pages could
be discardable. At last, they can't use deferred write of kernel
so that they could see performance loss.
(http://insights.oetiker.ch/linux/fadvise.html)

In fact, invalidate is very big hint to reclaimer.
It means we don't use the page any more. So let's move
the writing page into inactive list's head.

If it is real working set, it could have a enough time to
activate the page since we always try to keep many pages in
inactive list.

I reuse lru_demote of Peter with some change.

Reported-by: Ben Gamari <bgamari.foss@...il.com>
Signed-off-by: Minchan Kim <minchan.kim@...il.com>
Signed-off-by: Peter Zijlstra <peterz@...radead.org>
Cc: Rik van Riel <riel@...hat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>
Cc: Johannes Weiner <hannes@...xchg.org>
Cc: Nick Piggin <npiggin@...nel.dk>

Ben, Remain thing is to modify rsync and use
fadvise(POSIX_FADV_DONTNEED). Could you test it?
---
 include/linux/swap.h |    1 +
 mm/swap.c            |   61 ++++++++++++++++++++++++++++++++++++++++++++++++++
 mm/truncate.c        |   11 +++++---
 3 files changed, 69 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index eba53e7..a3c9248 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -213,6 +213,7 @@ extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern int lru_add_drain_all(void);
 extern void rotate_reclaimable_page(struct page *page);
+extern void lru_deactive_page(struct page *page);
 extern void swap_setup(void);
 
 extern void add_page_to_unevictable_list(struct page *page);
diff --git a/mm/swap.c b/mm/swap.c
index 3f48542..56fa298 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -39,6 +39,8 @@ int page_cluster;
 
 static DEFINE_PER_CPU(struct pagevec[NR_LRU_LISTS], lru_add_pvecs);
 static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactive_pvecs);
+
 
 /*
  * This path almost never happens for VM activity - pages are normally
@@ -266,6 +268,45 @@ void add_page_to_unevictable_list(struct page *page)
 	spin_unlock_irq(&zone->lru_lock);
 }
 
+static void __pagevec_lru_deactive(struct pagevec *pvec)
+{
+	int i, lru, file;
+
+	struct zone *zone = NULL;
+
+	for (i = 0; i < pagevec_count(pvec); i++) {
+		struct page *page = pvec->pages[i];
+		struct zone *pagezone = page_zone(page);
+
+		if (pagezone != zone) {
+			if (zone)
+				spin_unlock_irq(&zone->lru_lock);
+			zone = pagezone;
+			spin_lock_irq(&zone->lru_lock);
+		}
+
+		if (PageLRU(page)) {
+			if (PageActive(page)) {
+				file = page_is_file_cache(page);
+				lru = page_lru_base_type(page);
+				del_page_from_lru_list(zone, page,
+						lru + LRU_ACTIVE);
+				ClearPageActive(page);
+				ClearPageReferenced(page);
+				add_page_to_lru_list(zone, page, lru);
+				__count_vm_event(PGDEACTIVATE);
+
+				update_page_reclaim_stat(zone, page, file, 0);
+			}
+		}
+	}
+	if (zone)
+		spin_unlock_irq(&zone->lru_lock);
+
+	release_pages(pvec->pages, pvec->nr, pvec->cold);
+	pagevec_reinit(pvec);
+}
+
 /*
  * Drain pages out of the cpu's pagevecs.
  * Either "cpu" is the current CPU, and preemption has already been
@@ -292,8 +333,28 @@ static void drain_cpu_pagevecs(int cpu)
 		pagevec_move_tail(pvec);
 		local_irq_restore(flags);
 	}
+
+	pvec = &per_cpu(lru_deactive_pvecs, cpu);
+	if (pagevec_count(pvec))
+		__pagevec_lru_deactive(pvec);
+}
+
+/*
+ * Function used to forecefully demote a page to the head of the inactive
+ * list.
+ */
+void lru_deactive_page(struct page *page)
+{
+	if (likely(get_page_unless_zero(page))) {
+		struct pagevec *pvec = &get_cpu_var(lru_deactive_pvecs);
+
+		if (!pagevec_add(pvec, page))
+			__pagevec_lru_deactive(pvec);
+		put_cpu_var(lru_deactive_pvecs);
+	}
 }
 
+
 void lru_add_drain(void)
 {
 	drain_cpu_pagevecs(get_cpu());
diff --git a/mm/truncate.c b/mm/truncate.c
index cd94607..c73fb19 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -332,7 +332,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 {
 	struct pagevec pvec;
 	pgoff_t next = start;
-	unsigned long ret = 0;
+	unsigned long ret;
+	unsigned long count = 0;
 	int i;
 
 	pagevec_init(&pvec, 0);
@@ -359,8 +360,10 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 			if (lock_failed)
 				continue;
 
-			ret += invalidate_inode_page(page);
-
+			ret = invalidate_inode_page(page);
+			if (!ret)
+				lru_deactive_page(page);
+			count += ret;
 			unlock_page(page);
 			if (next > end)
 				break;
@@ -369,7 +372,7 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
 		mem_cgroup_uncharge_end();
 		cond_resched();
 	}
-	return ret;
+	return count;
 }
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
-- 
1.7.0.4

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/