linux-kernel - [RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b0a8f3490b491d4fd003c3e0493e940afaea5f2c.1684228065.git.raghavendra.kt@amd.com>
Date:   Tue, 16 May 2023 14:49:32 +0530
From:   Raghavendra K T <raghavendra.kt@....com>
To:     <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>
CC:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "Mel Gorman" <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "David Hildenbrand" <david@...hat.com>, <rppt@...nel.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Bharata B Rao <bharata@....com>,
        Aithal Srikanth <sraithal@....com>,
        "kernel test robot" <oliver.sang@...el.com>,
        Raghavendra K T <raghavendra.kt@....com>
Subject: [RFC PATCH V2 1/1] sched/numa: Fix disjoint set vma scan regression

 With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.

While this had improved significant system time overhead, there were corner
cases, which genuinely need some relaxation. For e.g.,

1) Concern raised by PeterZ, where if there are N partition sets of vmas
belonging to tasks, then unfairness in allowing these threads to scan could
potentially amplify the side effect of some of the vmas being left
unscanned.

2) Below reports of LKP numa01 benchmark regression.

Currently this was handled by allowing first two scanning unconditional
as indicated by mm->numa_scan_seq. This is imprecise since for some
benchmark vma scanning might itself start at numa_scan_seq > 2.

Solution:
Allow unconditional scanning of vmas of tasks depending on vma size. This
is achieved by maintaining a per vma scan counter, where

f(allowed_to_scan) = f(scan_counter <  vma_size / scan_size)

Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
regression.

Result:
numa01_THREAD_ALLOC result on 6.4.0-rc1 (that has w/ numascan enhancement)
                base-numascan           base                    base+fix
real            1m3.025s                1m24.163s               1m3.551s
user            213m44.232s             251m3.638s              219m55.662s
sys             6m26.598s               0m13.056s               2m35.767s

numa_hit                5478165         4395752         4907431
numa_local              5478103         4395366         4907044
numa_other                   62             386             387
numa_pte_updates        1989274           11606         1265014
numa_hint_faults        1756059             515         1135804
numa_hint_faults_local   971500             486          558076
numa_pages_migrated      784211              29          577728

Summary: Regression in base is recovered by allowing scanning as required.

[1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t

Reported-by: Aithal Srikanth <sraithal@....com>
Reported-by: kernel test robot <oliver.sang@...el.com>
Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
Signed-off-by: Raghavendra K T <raghavendra.kt@....com>
---
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 41 ++++++++++++++++++++++++++++++++--------
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..992e460a713e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
 	unsigned long next_scan;
 	unsigned long next_pid_reset;
 	unsigned long access_pids[2];
+	unsigned int scan_counter;
 };
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..2c3e17e7fc2f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2931,20 +2931,34 @@ static void reset_ptenuma_scan(struct task_struct *p)
 static bool vma_is_accessed(struct vm_area_struct *vma)
 {
 	unsigned long pids;
+	unsigned int vma_size;
+	unsigned int scan_threshold;
+	unsigned int scan_size;
+
+	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+
+	if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+		return true;
+
+	scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
+	/* vma size in MB */
+	vma_size = (vma->vm_end - vma->vm_start) >> 20;
+
+	/* Total scans needed to cover VMA */
+	scan_threshold = (vma_size / scan_size);
+
 	/*
-	 * Allow unconditional access first two times, so that all the (pages)
-	 * of VMAs get prot_none fault introduced irrespective of accesses.
+	 * Allow the scanning of half of disjoint set's VMA to induce
+	 * prot_none fault irrespective of accesses.
 	 * This is also done to avoid any side effect of task scanning
 	 * amplifying the unfairness of disjoint set of VMAs' access.
 	 */
-	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
-		return true;
-
-	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
-	return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
+	scan_threshold = 1 + (scan_threshold >> 1);
+	return (READ_ONCE(vma->numab_state->scan_counter) <= scan_threshold);
 }
 
-#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
+#define VMA_PID_RESET_PERIOD		(4 * sysctl_numa_balancing_scan_delay)
+#define DISJOINT_VMA_SCAN_RENEW_THRESH	16
 
 /*
  * The expensive part of numa migration is done from task_work context.
@@ -3058,6 +3072,8 @@ static void task_numa_work(struct callback_head *work)
 			/* Reset happens after 4 times scan delay of scan start */
 			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			WRITE_ONCE(vma->numab_state->scan_counter, 0);
 		}
 
 		/*
@@ -3068,6 +3084,13 @@ static void task_numa_work(struct callback_head *work)
 						vma->numab_state->next_scan))
 			continue;
 
+		/*
+		 * For long running tasks, renew the disjoint vma scanning
+		 * periodically.
+		 */
+		if (mm->numa_scan_seq && !(mm->numa_scan_seq % DISJOINT_VMA_SCAN_RENEW_THRESH))
+			WRITE_ONCE(vma->numab_state->scan_counter, 0);
+
 		/* Do not scan the VMA if task has not accessed */
 		if (!vma_is_accessed(vma))
 			continue;
@@ -3083,6 +3106,8 @@ static void task_numa_work(struct callback_head *work)
 			vma->numab_state->access_pids[0] = READ_ONCE(vma->numab_state->access_pids[1]);
 			vma->numab_state->access_pids[1] = 0;
 		}
+		WRITE_ONCE(vma->numab_state->scan_counter,
+				READ_ONCE(vma->numab_state->scan_counter) + 1);
 
 		do {
 			start = max(start, vma->vm_start);
-- 
2.34.1