linux-kernel - Re: [RFC PATCH] sched/numa: scan the vma if it has not been scanned for a while

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b6c85c1d-0f67-4cb1-d1fa-0cee7e70885a@amd.com>
Date: Mon, 24 Jun 2024 11:26:50 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Yujie Liu <yujie.liu@...el.com>
CC: Chen Yu <yu.c.chen@...el.com>, Mel Gorman <mgorman@...hsingularity.net>,
	Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Juri
 Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Chen Yu <yu.chen.surf@...il.com>, Tim Chen <tim.c.chen@...el.com>,
	<linux-kernel@...r.kernel.org>, Xiaoping Zhou <xiaoping.zhou@...el.com>
Subject: Re: [RFC PATCH] sched/numa: scan the vma if it has not been scanned
 for a while



On 6/18/2024 11:40 AM, Yujie Liu wrote:
> Hi Raghu,
> 
> On Tue, Jun 18, 2024 at 12:41:05AM +0530, Raghavendra K T wrote:
>> On 6/14/2024 10:26 AM, Chen Yu wrote:
>>> From: Yujie Liu <yujie.liu@...el.com>
>>>
>>> Problem statement:
>>> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
>>> Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
>>> a double-sword that, the reducing of the vma scan might create less Numa
>>> page fault information. The insufficient information makes it harder for
>>> the Numa balancer to make decision. Later,
>>> commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
>>> regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
>>> mm numa_scan_seq based unconditional scan") are found to bring back part
>>> of the performance.
>>>
>>> Recently when running SPECcpu on a 320 CPUs/2 Sockets system, a long
>>> duration of remote Numa node read was observed by PMU events. It causes
>>> high core-to-core variance and performance penalty. After the
>>> investigation, it is found that many vmas are skipped due to the active
>>> PID check. According to the trace events, in most cases, vma_is_accessed()
>>> returns false because both pids_active[0] and pids_active[1] have been
>>> cleared.
>>>
>>
>> Thank you for reporting this and also giving potential fix.
>> I do think this is a good fix to start with.
> 
> Thanks a lot for your valuable comments and suggestions.
> 
>>> As an experiment, if the vma_is_accessed() is hacked to always return true,
>>> the long duration remote Numa access is gone.
>>>
>>> Proposal:
>>> The main idea is to adjust vma_is_accessed() to let it return true easier.
>>>
>>> solution 1 is to extend the pids_active[] from 2 to N, which has already
>>> been proposed by Peter[1]. And how to decide N needs investigation.
>>>
>>
>> I am curious if this (PeterZ's suggestion) implementation in PATCH1 of
>> link:
>> https://lore.kernel.org/linux-mm/cover.1710829750.git.raghavendra.kt@amd.com/
>>
>> get some benefit. I did not see good usecase at that point. but worth a
>> try to see if it improves performance in your case.
> 
> PATCH1 extends the array size of pids_active[] from 2 to 4, so the
> history info can be kept for a longer time, but it is possible that the
> scanning could still be missed after the task sleeps for a long enough
> time. It seems that the N could be task-specific rather than a fixed
> value.
> 
> Anyway, we will test PATCH1 to see if it helps in our benchmark and
> come back later.
> 
>>> solution 2 is to compare the diff between mm->numa_scan_seq and
>>> vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
>>> scan the vma.
>>>
>>> solution 2 can be used to cover process-based workload(SPECcpu eg). The
>>> reason is: There is only 1 thread within this process. If this process
>>> access the vma at the beginning, then sleeps for a long time, the
>>> pid_active array will be cleared. When this process is woken up, it will
>>> never get a chance to set prot_none anymore. Because only the first 2
>>> times of access is regarded as accessed:
>>> (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
>>> and no other threads can help set this prot_none.
>>>
>>
>> To Summarize: (just thinking loud on the problem IIUC)
>> The issue overall is, we are not handling the scanning of a single
>> (fewer) thread task that sleeps or inactive) some time adequately.
>>
>> one solution is to unconditionally return true (in a way inversely
>> proportional to number of threads in a task).
>>
>> But,
>> 1. Does it regress single (or fewer) threaded tasks which does
>>   not really need aggressive scanning.
> 
> We haven't observed such regression so far as we don't have a suitable
> workload that can well represent the scenario of "tasks that do not
> need aggressive scanning."
> 
> In theory, it will bring extra scanning overhead, but the penalty of
> missing the necessary scanning for the tasks that do need to be migrated
> may be more serious since it can result in long time remote node memory
> access. This is more likely a trade-off between the scanning cost and
> coverage.
> 
>> 2. Are we able to address the issue for multi threaded tasks which
>> show similar kind of pattern (viz., inactive for some duration regularly).
> 
> Theoretically it should do. If multi-threads access different VMAs of
> their own, like autonuma bench THREAD_LOCAL, each thread can only help
> itself to do the pg_none tagging. We have observed slight performance
> improvement with this patch applied when running autonuma bench
> THREAD_LOCAL.
> 
> In common use cases, tasks with multiple threads are likely to share
> some vmas, so there could be higher chance that other threads help tag
> the pg_none for the current thread, thus we can tolerate more vma skip,
> and vice versa.
> 

Agree with above points, meanwhile when I ran my normal mmtest,
Results:

base = 6.10-rc4

autonumabench NUMA01
                            base                  patched
Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*

Duration User      380345.36   368252.04
Duration System      1358.89     1156.23
Duration Elapsed     2277.45     2213.25


autonumabench NUMA02

Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*

Duration User        1513.23     1575.48
Duration System         8.33        8.13
Duration Elapsed       28.59       29.71

kernbench
Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*

Duration User       68816.41    67615.74
Duration System     21873.94    22848.08
Duration Elapsed      506.66      504.55

( I have not done bench-marking with smaller threads, some larger
workload run is TBD)

But overall results look promising.
Also on the plus side we have a very simple patch, So

If PeterZ/Mel are okay with using nr_thread notion,
please feel free to add.

Reviewed-and-Tested-by: Raghavendra K T <raghavendra.kt@....com>