linux-kernel - [tip:sched/core] sched/numa: Limit the amount of virtual memory scanned in task_numa

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <tip-4620f8c1fda2af4ccbd11e194e2dd785f7d7f279@git.kernel.org>
Date:	Fri, 18 Sep 2015 01:48:30 -0700
From:	tip-bot for Rik van Riel <tipbot@...or.com>
To:	linux-tip-commits@...r.kernel.org
Cc:	hpa@...or.com, tglx@...utronix.de, peterz@...radead.org,
	mingo@...nel.org, linux-kernel@...r.kernel.org,
	aarcange@...hat.com, jstancek@...hat.com,
	torvalds@...ux-foundation.org, mgorman@...e.de, riel@...hat.com
Subject: [tip:sched/core] sched/numa:
  Limit the amount of virtual memory scanned in task_numa_work()

Commit-ID:  4620f8c1fda2af4ccbd11e194e2dd785f7d7f279
Gitweb:     http://git.kernel.org/tip/4620f8c1fda2af4ccbd11e194e2dd785f7d7f279
Author:     Rik van Riel <riel@...hat.com>
AuthorDate: Fri, 11 Sep 2015 09:00:27 -0400
Committer:  Ingo Molnar <mingo@...nel.org>
CommitDate: Fri, 18 Sep 2015 09:23:14 +0200

sched/numa: Limit the amount of virtual memory scanned in task_numa_work()

Currently task_numa_work() scans up to numa_balancing_scan_size_mb worth
of memory per invocation, but only counts memory areas that have at
least one PTE that is still present and not marked for numa hint faulting.

It will skip over arbitarily large amounts of memory that are either
unused, full of swap ptes, or full of PTEs that were already marked
for NUMA hint faults but have not been faulted on yet.

This can cause excessive amounts of CPU use, due to there being
essentially no upper limit on the scan rate of very large processes
that are not yet in a phase where they are actively accessing old
memory pages (eg. they are still initializing their data).

Avoid that problem by placing an upper limit on the amount of virtual
memory that task_numa_work() scans in each invocation. This can be a
higher limit than "pages", to ensure the task still skips over unused
areas fairly quickly.

While we are here, also fix the "nr_pte_updates" logic, so it only
counts page ranges with ptes in them.

Reported-by: Andrea Arcangeli <aarcange@...hat.com>
Reported-by: Jan Stancek <jstancek@...hat.com>
Signed-off-by: Rik van Riel <riel@...hat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@...radead.org>
Acked-by: Mel Gorman <mgorman@...e.de>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Peter Zijlstra <peterz@...radead.org>
Cc: Thomas Gleixner <tglx@...utronix.de>
Link: http://lkml.kernel.org/r/20150911090027.4a7987bd@annuminas.surriel.com
Signed-off-by: Ingo Molnar <mingo@...nel.org>
---
 kernel/sched/fair.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9176f7c..1bfad9f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2157,7 +2157,7 @@ void task_numa_work(struct callback_head *work)
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
-	long pages;
+	long pages, virtpages;
 
 	WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
 
@@ -2203,9 +2203,11 @@ void task_numa_work(struct callback_head *work)
 	start = mm->numa_scan_offset;
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+	virtpages = pages * 8;	   /* Scan up to this much virtual space */
 	if (!pages)
 		return;
 
+
 	down_read(&mm->mmap_sem);
 	vma = find_vma(mm, start);
 	if (!vma) {
@@ -2240,18 +2242,22 @@ void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			nr_pte_updates += change_prot_numa(vma, start, end);
+			nr_pte_updates = change_prot_numa(vma, start, end);
 
 			/*
-			 * Scan sysctl_numa_balancing_scan_size but ensure that
-			 * at least one PTE is updated so that unused virtual
-			 * address space is quickly skipped.
+			 * Try to scan sysctl_numa_balancing_size worth of
+			 * hpages that have at least one present PTE that
+			 * is not already pte-numa. If the VMA contains
+			 * areas that are unused or already full of prot_numa
+			 * PTEs, scan up to virtpages, to skip through those
+			 * areas faster.
 			 */
 			if (nr_pte_updates)
 				pages -= (end - start) >> PAGE_SHIFT;
+			virtpages -= (end - start) >> PAGE_SHIFT;
 
 			start = end;
-			if (pages <= 0)
+			if (pages <= 0 || virtpages <= 0)
 				goto out;
 
 			cond_resched();
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/