linux-kernel - Re: [PATCH 0/18] Basic scheduler support for automatic NUMA balancing V5

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130725103620.GM27075@twins.programming.kicks-ass.net>
Date:	Thu, 25 Jul 2013 12:36:20 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Mel Gorman <mgorman@...e.de>
Cc:	Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
	Ingo Molnar <mingo@...nel.org>,
	Andrea Arcangeli <aarcange@...hat.com>,
	Johannes Weiner <hannes@...xchg.org>,
	Linux-MM <linux-mm@...ck.org>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 0/18] Basic scheduler support for automatic NUMA
 balancing V5


Subject: sched, numa: Break stuff..
From: Peter Zijlstra <peterz@...radead.org>
Date: Tue Jul 23 14:58:41 CEST 2013

This patch is mostly a comment in code. I don't believe the current
scan period adjustment scheme can work properly nor do I think it a
good idea to ratelimit the numa faults as a whole based on migration.

Reasons are in the modified comments...

Signed-off-by: Peter Zijlstra <peterz@...radead.org>
---
 kernel/sched/fair.c |   41 ++++++++++++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 9 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1108,7 +1108,6 @@ static void task_numa_placement(struct t
 
 	/* Preferred node as the node with the most faults */
 	if (max_faults && max_nid != p->numa_preferred_nid) {
-		int old_migrate_seq = p->numa_migrate_seq;
 
 		/* Queue task on preferred node if possible */
 		p->numa_preferred_nid = max_nid;
@@ -1116,14 +1115,19 @@ static void task_numa_placement(struct t
 		numa_migrate_preferred(p);
 
 		/*
+		int old_migrate_seq = p->numa_migrate_seq;
+		 *
 		 * If preferred nodes changes frequently then the scan rate
 		 * will be continually high. Mitigate this by increasing the
 		 * scan rate only if the task was settled.
-		 */
+		 *
+		 * APZ: disabled because we don't lower it again :/
+		 *
 		if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
 			p->numa_scan_period = max(p->numa_scan_period >> 1,
 					task_scan_min(p));
 		}
+		 */
 	}
 }
 
@@ -1167,10 +1171,20 @@ void task_numa_fault(int last_nidpid, in
 	/*
 	 * If pages are properly placed (did not migrate) then scan slower.
 	 * This is reset periodically in case of phase changes
-	 */
-        if (!migrated)
+	 *
+	 * APZ: it seems to me that one can get a ton of !migrated faults;
+	 * consider the scenario where two threads fight over a shared memory
+	 * segment. We'll win half the faults, half of that will be local, half
+	 * of that will be remote. This means we'll see 1/4-th of the total
+	 * memory being !migrated. Using a fixed increment will completely
+	 * flatten the scan speed for a sufficiently large workload. Another
+	 * scenario is due to that migration rate limit.
+	 *
+        if (!migrated) {
 		p->numa_scan_period = min(p->numa_scan_period_max,
 			p->numa_scan_period + jiffies_to_msecs(10));
+	}
+	 */
 
 	task_numa_placement(p);
 
@@ -1216,12 +1230,15 @@ void task_numa_work(struct callback_head
 	if (p->flags & PF_EXITING)
 		return;
 
+#if 0
 	/*
 	 * We do not care about task placement until a task runs on a node
 	 * other than the first one used by the address space. This is
 	 * largely because migrations are driven by what CPU the task
 	 * is running on. If it's never scheduled on another node, it'll
 	 * not migrate so why bother trapping the fault.
+	 *
+	 * APZ: seems like a bad idea for pure shared memory workloads.
 	 */
 	if (mm->first_nid == NUMA_PTE_SCAN_INIT)
 		mm->first_nid = numa_node_id();
@@ -1233,6 +1250,7 @@ void task_numa_work(struct callback_head
 
 		mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
 	}
+#endif
 
 	/*
 	 * Enforce maximal scan/migration frequency..
@@ -1254,9 +1272,14 @@ void task_numa_work(struct callback_head
 	 * Do not set pte_numa if the current running node is rate-limited.
 	 * This loses statistics on the fault but if we are unwilling to
 	 * migrate to this node, it is less likely we can do useful work
-	 */
+	 *
+	 * APZ: seems like a bad idea; even if this node can't migrate anymore
+	 * other nodes might and we want up-to-date information to do balance
+	 * decisions.
+	 *
 	if (migrate_ratelimited(numa_node_id()))
 		return;
+	 */
 
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
@@ -1297,10 +1320,10 @@ void task_numa_work(struct callback_head
 
 out:
 	/*
-	 * It is possible to reach the end of the VMA list but the last few VMAs are
-	 * not guaranteed to the vma_migratable. If they are not, we would find the
-	 * !migratable VMA on the next scan but not reset the scanner to the start
-	 * so check it now.
+	 * It is possible to reach the end of the VMA list but the last few
+	 * VMAs are not guaranteed to the vma_migratable. If they are not, we
+	 * would find the !migratable VMA on the next scan but not reset the
+	 * scanner to the start so check it now.
 	 */
 	if (vma)
 		mm->numa_scan_offset = start;
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/