lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8581ca937d4064b3cd138845d5bd418580d177da.1685506205.git.raghavendra.kt@amd.com>
Date:   Wed, 31 May 2023 09:55:26 +0530
From:   Raghavendra K T <raghavendra.kt@....com>
To:     <linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>
CC:     Ingo Molnar <mingo@...hat.com>,
        Peter Zijlstra <peterz@...radead.org>,
        "Mel Gorman" <mgorman@...e.de>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "David Hildenbrand" <david@...hat.com>, <rppt@...nel.org>,
        Juri Lelli <juri.lelli@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Bharata B Rao <bharata@....com>,
        Aithal Srikanth <sraithal@....com>,
        "kernel test robot" <oliver.sang@...el.com>,
        Raghavendra K T <raghavendra.kt@....com>
Subject: [RFC PATCH V3 1/1] sched/numa: Fix disjoint set vma scan regression

 With the numa scan enhancements [1], only the threads which had previously
accessed vma are allowed to scan.

While this had improved significant system time overhead, there were corner
cases, which genuinely need some relaxation. For e.g.,

1) Concern raised by PeterZ, where if there are N partition sets of vmas
belonging to tasks, then unfairness in allowing these threads to scan could
potentially amplify the side effect of some of the vmas being left
unscanned.

2) Below reports of LKP numa01 benchmark regression.

Currently this is handled by allowing first two scanning unconditional
as indicated by mm->numa_scan_seq. This is imprecise since for some
benchmark vma scanning might itself start at numa_scan_seq > 2.

Solution:
Allow unconditional scanning of vmas of tasks depending on vma size. This
is achieved by maintaining a per vma scan counter, where

f(allowed_to_scan) = f(scan_counter <  vma_size / scan_size)

Result:
numa01_THREAD_ALLOC result on 6.4.0-rc2 (that has numascan enhancement)
                	base-numascan	base		base+fix
real    		1m1.507s	1m23.259s	1m2.632s
user    		213m51.336s	251m46.363s	220m35.528s
sys     		3m3.397s	0m12.492s	2m41.393s

numa_hit 		5615517		4560123		4963875
numa_local 		5615505		4560024		4963700
numa_other 		12		99		175
numa_pte_updates 	1822797		493		1559111
numa_hint_faults 	1307113		523		1469031
numa_hint_faults_local 	612617		488		884829
numa_pages_migrated 	694370		35		584202

Summary: Regression in base is recovered by allowing scanning as required.

[1] https://lore.kernel.org/lkml/cover.1677672277.git.raghavendra.kt@amd.com/T/#t

Fixes: fc137c0ddab2 ("sched/numa: enhance vma scanning logic")
regression.
Reported-by: Aithal Srikanth <sraithal@....com>
Reported-by: kernel test robot <oliver.sang@...el.com>
Closes: https://lore.kernel.org/lkml/db995c11-08ba-9abf-812f-01407f70a5d4@amd.com/T/
Signed-off-by: Raghavendra K T <raghavendra.kt@....com>
---
 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 31 ++++++++++++++++++++++++-------
 2 files changed, 25 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 306a3d1a0fa6..992e460a713e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -479,6 +479,7 @@ struct vma_numab_state {
 	unsigned long next_scan;
 	unsigned long next_pid_reset;
 	unsigned long access_pids[2];
+	unsigned int scan_counter;
 };
 
 /*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..4e71fb58085b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2931,17 +2931,30 @@ static void reset_ptenuma_scan(struct task_struct *p)
 static bool vma_is_accessed(struct vm_area_struct *vma)
 {
 	unsigned long pids;
+	unsigned int vma_size;
+	unsigned int scan_threshold;
+	unsigned int scan_size;
+
+	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
+
+	if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+		return true;
+
+	scan_size = READ_ONCE(sysctl_numa_balancing_scan_size);
+	/* vma size in MB */
+	vma_size = (vma->vm_end - vma->vm_start) >> 20;
+
+	/* Total scans needed to cover VMA */
+	scan_threshold = vma_size / scan_size;
+
 	/*
-	 * Allow unconditional access first two times, so that all the (pages)
-	 * of VMAs get prot_none fault introduced irrespective of accesses.
+	 * Allow the scanning of half of disjoint set's VMA to induce
+	 * prot_none fault irrespective of accesses.
 	 * This is also done to avoid any side effect of task scanning
 	 * amplifying the unfairness of disjoint set of VMAs' access.
 	 */
-	if (READ_ONCE(current->mm->numa_scan_seq) < 2)
-		return true;
-
-	pids = vma->numab_state->access_pids[0] | vma->numab_state->access_pids[1];
-	return test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids);
+	scan_threshold = 1 + (scan_threshold >> 1);
+	return (vma->numab_state->scan_counter < scan_threshold);
 }
 
 #define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
@@ -3058,6 +3071,8 @@ static void task_numa_work(struct callback_head *work)
 			/* Reset happens after 4 times scan delay of scan start */
 			vma->numab_state->next_pid_reset =  vma->numab_state->next_scan +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+
+			vma->numab_state->scan_counter = 0;
 		}
 
 		/*
@@ -3084,6 +3099,8 @@ static void task_numa_work(struct callback_head *work)
 			vma->numab_state->access_pids[1] = 0;
 		}
 
+		vma->numab_state->scan_counter++;
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
-- 
2.34.1

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ