linux-kernel - [PATCH] sched/numa: scan the vma if it has not been scanned for a while

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20240630150032.533210-1-yujie.liu@intel.com>
Date: Sun, 30 Jun 2024 23:00:32 +0800
From: Yujie Liu <yujie.liu@...el.com>
To: Raghavendra K T <raghavendra.kt@....com>,
	Mel Gorman <mgorman@...hsingularity.net>,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>
Cc: Chen Yu <yu.chen.surf@...il.com>,
	Tim Chen <tim.c.chen@...el.com>,
	linux-kernel@...r.kernel.org,
	Xiaoping Zhou <xiaoping.zhou@...el.com>,
	Chen Yu <yu.c.chen@...el.com>,
	Yujie Liu <yujie.liu@...el.com>
Subject: [PATCH] sched/numa: scan the vma if it has not been scanned for a while

Problem statement:
Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
a double-sword that, the reducing of the vma scan might create less Numa
page fault information. The insufficient information makes it harder for
the Numa balancer to make decision. Later,
commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
mm numa_scan_seq based unconditional scan") are found to bring back part
of the performance.

Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system,
a long duration of remote Numa node read was observed by PMU events:
A few cores having ~500MB/s remote memory access for ~20 seconds.
It causes high core-to-core variance and performance penalty. After the
investigation, it is found that many vmas are skipped due to the active
PID check. According to the trace events, in most cases, vma_is_accessed()
returns false because the history access info stored in pids_active
array has been cleared.

Proposal:
The main idea is to adjust vma_is_accessed() to let it return true easier.

solution 1 is to extend the pids_active[] from 2 to N, which was proposed
by Raghavendra[1]. And it is under investigation how to choose the N.

solution 2 is to compare the diff between mm->numa_scan_seq and
vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
scan the vma.

solution 2 can be used to especially help the cases where there are
limited number of shared VMAs, the process-based SPECcpu eg. Without
solution 2, it is possible that, if the single process access the vma
at the beginning, then sleeps for a long time(the pid_active array
been cleared), when this process is woken up, it will never get a
chance to set prot_none anymore. Because only the first 2 times of
access is regarded as accessed:
(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
and no other threads within the task can help set the prot_none.

Raghavendra helped test current patch and got the positive result
on AMD platform:

autonumabench NUMA01
                            base                  patched
Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*

Duration User      380345.36   368252.04
Duration System      1358.89     1156.23
Duration Elapsed     2277.45     2213.25

autonumabench NUMA02

Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*

Duration User        1513.23     1575.48
Duration System         8.33        8.13
Duration Elapsed       28.59       29.71

kernbench

Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*

Duration User       68816.41    67615.74
Duration System     21873.94    22848.08
Duration Elapsed      506.66      504.55


Intel 256 CPUs/2 Sockets:
autonuma benchmark also shows some improvements:

                                               v6.10-rc5              v6.10-rc5
                                                                         +patch
Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*

                   v6.10-rc5   v6.10-rc5
                                  +patch
Duration User     1064010.16  1075416.23
Duration System      3307.64     3104.66
Duration Elapsed     4537.54     4604.73

Link: https://lore.kernel.org/lkml/88d16815ef4cc2b6c08b4bb713b25421b5589bc7.1710829750.git.raghavendra.kt@amd.com/ #1
Reported-by: Xiaoping Zhou <xiaoping.zhou@...el.com>
Co-developed-by: Chen Yu <yu.c.chen@...el.com>
Signed-off-by: Chen Yu <yu.c.chen@...el.com>
Signed-off-by: Yujie Liu <yujie.liu@...el.com>
Reviewed-and-Tested-by: Raghavendra K T <raghavendra.kt@....com>
---
 kernel/sched/fair.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a5b1ae0aa55..2b74fc06fb95 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3188,6 +3188,14 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
 		return true;
 	}
 
+	/*
+	 * This vma has not been accessed for a while, and has limited number of threads
+	 * within the current task can help.
+	 */
+	if (READ_ONCE(mm->numa_scan_seq) >
+	   (vma->numab_state->prev_scan_seq + get_nr_threads(current)))
+		return true;
+
 	return false;
 }
 
-- 
2.34.1