linux-kernel - Re: [PATCH 08/11] Add /proc trigger for memory compaction

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Fri, 26 Mar 2010 10:46:18 +0000
From:	Mel Gorman <mel@....ul.ie>
To:	Andrew Morton <akpm@...ux-foundation.org>
Cc:	Andrea Arcangeli <aarcange@...hat.com>,
	Christoph Lameter <cl@...ux-foundation.org>,
	Adam Litke <agl@...ibm.com>, Avi Kivity <avi@...hat.com>,
	David Rientjes <rientjes@...gle.com>,
	Minchan Kim <minchan.kim@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	Rik van Riel <riel@...hat.com>, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org
Subject: Re: [PATCH 08/11] Add /proc trigger for memory compaction

On Wed, Mar 24, 2010 at 01:33:51PM -0700, Andrew Morton wrote:
> On Tue, 23 Mar 2010 12:25:43 +0000
> Mel Gorman <mel@....ul.ie> wrote:
> 
> > This patch adds a proc file /proc/sys/vm/compact_memory. When an arbitrary
> > value is written to the file, all zones are compacted. The expected user
> > of such a trigger is a job scheduler that prepares the system before the
> > target application runs.
> > 
> >
> > ...
> >
> > +/* This is the entry point for compacting all nodes via /proc/sys/vm */
> > +int sysctl_compaction_handler(struct ctl_table *table, int write,
> > +			void __user *buffer, size_t *length, loff_t *ppos)
> > +{
> > +	if (write)
> > +		return compact_nodes();
> > +
> > +	return 0;
> > +}
> 
> Neato.  When I saw the overall description I was afraid that this stuff
> would be fiddling with kernel threads.
> 

Not yet anyway. It has been floated in the past to have a kcompactd
similar to kswapd but right now there is no justification for it. Like
other suggestions made in the past, it has potential but needs data to
justify.

> The underlying compaction code can at times cause rather large amounts
> of memory to be put onto private lists, so it's lost to the rest of the
> kernel.  What happens if 10000 processes simultaneously write to this
> thing?  It's root-only so I guess the answer is "root becomes unemployed".
> 

Well, root becomes unemployed but I shouldn't be supplying the rope.
Lets keep min_free_kbytes as the "fall off the cliff" tunable. I added
too_many_isolated()-like logic and also handling of fatal signals.

> I fear that the overall effect of this feature is that people will come
> up with ghastly hacks which keep on poking this tunable as a workaround
> for some VM shortcoming.  This will lead to more shortcomings, and
> longer-lived ones.
> 

That would be very unfortunate and also a self-defeating measure in the short
run, let alone the long run.  I consider the tunable to be more like the
"drop_caches" tunable. It can be used for good or bad and all the bad uses
kick you in the ass because it does not resolve the underlying problem and
is expensive to use.

I had three legit uses in mind for it

1. Batch-systems that compact memory before a job is scheduler to reduce
   start-up time of applications using huge pages. Depending on their
   setup, sysfs might be a better fit for them

2. Illustrate a bug in direct compaction. i.e. I'd get a report on some
   allocation failure that was consistent but when the tunable is poked,
   it works perfectly

3. Development uses. Measuring worst-case scenarios for compaction (rare
   obviously), stress testing compaction to try catch bugs in migration
   and measuring how effective compaction currently is.

Do these justify the existance of the tunable or is the risk of abuse
too high?

This is what the isolate logic looks like


diff --git a/mm/compaction.c b/mm/compaction.c
index e0e8100..a6a6958 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -13,6 +13,7 @@
 #include <linux/mm_inline.h>
 #include <linux/sysctl.h>
 #include <linux/sysfs.h>
+#include <linux/backing-dev.h>
 #include "internal.h"
 
 /*
@@ -197,6 +198,20 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
 	__mod_zone_page_state(zone, NR_ISOLATED_FILE, cc->nr_file);
 }
 
+/* Similar to reclaim, but different enough that they don't share logic */
+static int too_many_isolated(struct zone *zone)
+{
+
+	unsigned long inactive, isolated;
+
+	inactive = zone_page_state(zone, NR_INACTIVE_FILE) +
+					zone_page_state(zone, NR_INACTIVE_ANON);
+	isolated = zone_page_state(zone, NR_ISOLATED_FILE) +
+					zone_page_state(zone, NR_ISOLATED_ANON);
+
+	return isolated > inactive;
+}
+
 /*
  * Isolate all pages that can be migrated from the block pointed to by
  * the migrate scanner within compact_control.
@@ -223,6 +238,14 @@ static unsigned long isolate_migratepages(struct zone *zone,
 		return 0;
 	}
 
+	/* Do not isolate the world */
+	while (unlikely(too_many_isolated(zone))) {
+		congestion_wait(BLK_RW_ASYNC, HZ/10);
+
+		if (fatal_signal_pending(current))
+			return 0;
+	}
+
 	/* Time to isolate some pages for migration */
 	spin_lock_irq(&zone->lru_lock);
 	for (; low_pfn < end_pfn; low_pfn++) {
@@ -309,6 +332,9 @@ static int compact_finished(struct zone *zone,
 	unsigned int order;
 	unsigned long watermark = low_wmark_pages(zone) + (1 << cc->order);
 
+	if (fatal_signal_pending(current))
+		return COMPACT_PARTIAL;
+
 	/* Compaction run completes if the migrate and free scanner meet */
 	if (cc->free_pfn <= cc->migrate_pfn)
 		return COMPACT_COMPLETE;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/