linux-kernel - Re: [PATCH] sched/numa: scan the vma if it has not been scanned for a while

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZpUbVK/xgsrGuXUP@chenyu5-mobl2>
Date: Mon, 15 Jul 2024 20:51:32 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Peter Zijlstra <peterz@...radead.org>, Mel Gorman
	<mgorman@...hsingularity.net>
CC: Raghavendra K T <raghavendra.kt@....com>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Chen Yu <yu.chen.surf@...il.com>, Tim Chen
	<tim.c.chen@...el.com>, <linux-kernel@...r.kernel.org>, Xiaoping Zhou
	<xiaoping.zhou@...el.com>
Subject: Re: [PATCH] sched/numa: scan the vma if it has not been scanned for
 a while

On 2024-06-30 at 23:00:32 +0800, Yujie Liu wrote:
> Problem statement:
> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
> Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
> a double-sword that, the reducing of the vma scan might create less Numa
> page fault information. The insufficient information makes it harder for
> the Numa balancer to make decision. Later,
> commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
> regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
> mm numa_scan_seq based unconditional scan") are found to bring back part
> of the performance.
> 
> Recently when running SPECcpu omnetpp_r on a 320 CPUs/2 Sockets system,
> a long duration of remote Numa node read was observed by PMU events:
> A few cores having ~500MB/s remote memory access for ~20 seconds.
> It causes high core-to-core variance and performance penalty. After the
> investigation, it is found that many vmas are skipped due to the active
> PID check. According to the trace events, in most cases, vma_is_accessed()
> returns false because the history access info stored in pids_active
> array has been cleared.
> 
> Proposal:
> The main idea is to adjust vma_is_accessed() to let it return true easier.
> 
> solution 1 is to extend the pids_active[] from 2 to N, which was proposed
> by Raghavendra[1]. And it is under investigation how to choose the N.
> 
> solution 2 is to compare the diff between mm->numa_scan_seq and
> vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
> scan the vma.
> 
> solution 2 can be used to especially help the cases where there are
> limited number of shared VMAs, the process-based SPECcpu eg. Without
> solution 2, it is possible that, if the single process access the vma
> at the beginning, then sleeps for a long time(the pid_active array
> been cleared), when this process is woken up, it will never get a
> chance to set prot_none anymore. Because only the first 2 times of
> access is regarded as accessed:
> (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
> and no other threads within the task can help set the prot_none.
> 
> Raghavendra helped test current patch and got the positive result
> on AMD platform:
> 
> autonumabench NUMA01
>                             base                  patched
> Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
> Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*
> 
> Duration User      380345.36   368252.04
> Duration System      1358.89     1156.23
> Duration Elapsed     2277.45     2213.25
> 
> autonumabench NUMA02
> 
> Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
> Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*
> 
> Duration User        1513.23     1575.48
> Duration System         8.33        8.13
> Duration Elapsed       28.59       29.71
> 
> kernbench
> 
> Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
> Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
> Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*
> 
> Duration User       68816.41    67615.74
> Duration System     21873.94    22848.08
> Duration Elapsed      506.66      504.55
> 
> 
> Intel 256 CPUs/2 Sockets:
> autonuma benchmark also shows some improvements:
> 
>                                                v6.10-rc5              v6.10-rc5
>                                                                          +patch
> Amean     syst-NUMA01                  245.85 (   0.00%)      230.84 *   6.11%*
> Amean     syst-NUMA01_THREADLOCAL      205.27 (   0.00%)      191.86 *   6.53%*
> Amean     syst-NUMA02                   18.57 (   0.00%)       18.09 *   2.58%*
> Amean     syst-NUMA02_SMT                2.63 (   0.00%)        2.54 *   3.47%*
> Amean     elsp-NUMA01                  517.17 (   0.00%)      526.34 *  -1.77%*
> Amean     elsp-NUMA01_THREADLOCAL       99.92 (   0.00%)      100.59 *  -0.67%*
> Amean     elsp-NUMA02                   15.81 (   0.00%)       15.72 *   0.59%*
> Amean     elsp-NUMA02_SMT               13.23 (   0.00%)       12.89 *   2.53%*
> 
>                    v6.10-rc5   v6.10-rc5
>                                   +patch
> Duration User     1064010.16  1075416.23
> Duration System      3307.64     3104.66
> Duration Elapsed     4537.54     4604.73
> 
> Link: https://lore.kernel.org/lkml/88d16815ef4cc2b6c08b4bb713b25421b5589bc7.1710829750.git.raghavendra.kt@amd.com/ #1
> Reported-by: Xiaoping Zhou <xiaoping.zhou@...el.com>
> Co-developed-by: Chen Yu <yu.c.chen@...el.com>
> Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> Signed-off-by: Yujie Liu <yujie.liu@...el.com>
> Reviewed-and-Tested-by: Raghavendra K T <raghavendra.kt@....com>
> ---

Hi Peter, Mel,

May I know if this patch is in the right direction? It fixes
a SPECcpu performance regression found recently.

thanks,
Chenyu