lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4BD57213.7060207@linux.vnet.ibm.com>
Date:	Mon, 26 Apr 2010 12:59:31 +0200
From:	Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
To:	Rik van Riel <riel@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Nick Piggin <npiggin@...e.de>, gregkh@...ell.com
CC:	Mel Gorman <mel@....ul.ie>, linux-mm@...ck.org,
	Chris Mason <chris.mason@...cle.com>,
	Jens Axboe <jens.axboe@...cle.com>,
	linux-kernel@...r.kernel.org, Corrado Zoccolo <czoccolo@...il.com>
Subject: Subject: [PATCH][RFC] mm: make working set portion that is protected
 tunable v2

Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

From: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>

*updates in v2*
- use do_div

This patch creates a knob to help users that have workloads suffering from the
fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
evict use-once pages first".
It also provides the tuning mechanisms for other users that want an even bigger
working set to be protected.

To be honest the best solution would be to allow a system not using the working
set to regain that memory *somewhen*, and therefore without drawbacks to the
scenarios it was implemented for e.g. UI interactivity while copying a lot of
data. But up to now there was no idea how to get that behaviour implemented.

In the old thread started by Elladan that finally led to 56e49d21 Wu Fengguang
wrote:
 "In the worse scenario, it could waste half the memory that could
 otherwise be used for readahead buffer and to prevent thrashing, in a
 server serving large datasets that are hardly reused, but still slowly
 builds up its active list during the long uptime (think about a slowly
 performance downgrade that can be fixed by a crude dropcache action).

 That said, the actual performance degradation could be much smaller -
 say 15% - all memories are not equal."

We now identified a case with up to -60% Throughput, therefore this patch tries
to provide a more gentle interface than drop_caches to help a system stuck in
this.

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0%  - like a kernel pre 56e49d21
- x%  - allow customizing the system to someones needs

Due to our experiments the suggested default in this patch is 25%, but if
preferred I'm fine keeping 50% and letting admins/distros adapt as needed.

Signed-off-by: Christian Ehrhardt <ehrhardt@...ux.vnet.ibm.com>
---

[diffstat]
 Documentation/sysctl/vm.txt |   10 ++++++++++
 include/linux/mm.h          |    2 ++
 kernel/sysctl.c             |    9 +++++++++
 mm/memcontrol.c             |    9 ++++++---
 mm/vmscan.c                 |   17 ++++++++++++++---
 5 files changed, 41 insertions(+), 6 deletions(-)

[diff]
Index: linux-2.6/Documentation/sysctl/vm.txt
===================================================================
--- linux-2.6.orig/Documentation/sysctl/vm.txt	2010-04-21 06:32:23.000000000 +0200
+++ linux-2.6/Documentation/sysctl/vm.txt	2010-04-21 07:24:35.000000000 +0200
@@ -18,6 +18,7 @@
 
 Currently, these files are in /proc/sys/vm:
 
+- active_inactive_ratio
 - block_dump
 - dirty_background_bytes
 - dirty_background_ratio
@@ -57,6 +58,15 @@
 
 ==============================================================
 
+active_inactive_ratio
+
+The kernel tries to protect the active working set. Therefore a portion of the
+file pages is protected, meaning they are omitted when eviting pages until this
+ratio is reached.
+This tunable represents that ratio in percent and specifies the protected part
+
+==============================================================
+
 block_dump
 
 block_dump enables block I/O debugging when set to a nonzero value. More
Index: linux-2.6/kernel/sysctl.c
===================================================================
--- linux-2.6.orig/kernel/sysctl.c	2010-04-21 06:33:43.000000000 +0200
+++ linux-2.6/kernel/sysctl.c	2010-04-21 07:26:35.000000000 +0200
@@ -1271,6 +1271,15 @@
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "active_inactive_ratio",
+		.data		= &sysctl_active_inactive_ratio,
+		.maxlen		= sizeof(sysctl_active_inactive_ratio),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+		.extra2		= &one_hundred,
+	},
 
 /*
  * NOTE: do not add new entries to this table unless you have read
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/memcontrol.c	2010-04-26 12:45:46.000000000 +0200
@@ -893,12 +893,15 @@
 int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 {
 	unsigned long active;
-	unsigned long inactive;
+	unsigned long activetoprotect;
 
-	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
+	activetoprotect = active
+		+ mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE)
+		* sysctl_active_inactive_ratio;
+	activetoprotect = do_div(activetoprotect, 100);
 
-	return (active > inactive);
+	return (active > activetoprotect);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c	2010-04-21 06:31:29.000000000 +0200
+++ linux-2.6/mm/vmscan.c	2010-04-26 12:50:47.000000000 +0200
@@ -1459,14 +1459,25 @@
 	return low;
 }
 
+/*
+ * sysctl_active_inactive_ratio
+ *
+ * Defines the portion of file pages within the active working set is going to
+ * be protected. The value represents the percentage that will be protected.
+ */
+int sysctl_active_inactive_ratio __read_mostly = 25;
+
 static int inactive_file_is_low_global(struct zone *zone)
 {
-	unsigned long active, inactive;
+	unsigned long active, activetoprotect;
 
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
-	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
+	activetoprotect = zone_page_state(zone, NR_FILE)
+			* sysctl_active_inactive_ratio;
+	activetoprotect = do_div(activetoprotect, 100);
+
+	return (active > activetoprotect);
 
-	return (active > inactive);
 }
 
 /**
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h	2010-04-21 09:02:37.000000000 +0200
+++ linux-2.6/include/linux/mm.h	2010-04-21 09:02:51.000000000 +0200
@@ -1467,5 +1467,7 @@
 
 extern void dump_page(struct page *page);
 
+extern int sysctl_active_inactive_ratio;
+
 #endif /* __KERNEL__ */
 #endif /* _LINUX_MM_H */
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ