[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b6c85c1d-0f67-4cb1-d1fa-0cee7e70885a@amd.com>
Date: Mon, 24 Jun 2024 11:26:50 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Yujie Liu <yujie.liu@...el.com>
CC: Chen Yu <yu.c.chen@...el.com>, Mel Gorman <mgorman@...hsingularity.net>,
Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Juri
Lelli <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
Chen Yu <yu.chen.surf@...il.com>, Tim Chen <tim.c.chen@...el.com>,
<linux-kernel@...r.kernel.org>, Xiaoping Zhou <xiaoping.zhou@...el.com>
Subject: Re: [RFC PATCH] sched/numa: scan the vma if it has not been scanned
for a while
On 6/18/2024 11:40 AM, Yujie Liu wrote:
> Hi Raghu,
>
> On Tue, Jun 18, 2024 at 12:41:05AM +0530, Raghavendra K T wrote:
>> On 6/14/2024 10:26 AM, Chen Yu wrote:
>>> From: Yujie Liu <yujie.liu@...el.com>
>>>
>>> Problem statement:
>>> Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
>>> Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
>>> a double-sword that, the reducing of the vma scan might create less Numa
>>> page fault information. The insufficient information makes it harder for
>>> the Numa balancer to make decision. Later,
>>> commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
>>> regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
>>> mm numa_scan_seq based unconditional scan") are found to bring back part
>>> of the performance.
>>>
>>> Recently when running SPECcpu on a 320 CPUs/2 Sockets system, a long
>>> duration of remote Numa node read was observed by PMU events. It causes
>>> high core-to-core variance and performance penalty. After the
>>> investigation, it is found that many vmas are skipped due to the active
>>> PID check. According to the trace events, in most cases, vma_is_accessed()
>>> returns false because both pids_active[0] and pids_active[1] have been
>>> cleared.
>>>
>>
>> Thank you for reporting this and also giving potential fix.
>> I do think this is a good fix to start with.
>
> Thanks a lot for your valuable comments and suggestions.
>
>>> As an experiment, if the vma_is_accessed() is hacked to always return true,
>>> the long duration remote Numa access is gone.
>>>
>>> Proposal:
>>> The main idea is to adjust vma_is_accessed() to let it return true easier.
>>>
>>> solution 1 is to extend the pids_active[] from 2 to N, which has already
>>> been proposed by Peter[1]. And how to decide N needs investigation.
>>>
>>
>> I am curious if this (PeterZ's suggestion) implementation in PATCH1 of
>> link:
>> https://lore.kernel.org/linux-mm/cover.1710829750.git.raghavendra.kt@amd.com/
>>
>> get some benefit. I did not see good usecase at that point. but worth a
>> try to see if it improves performance in your case.
>
> PATCH1 extends the array size of pids_active[] from 2 to 4, so the
> history info can be kept for a longer time, but it is possible that the
> scanning could still be missed after the task sleeps for a long enough
> time. It seems that the N could be task-specific rather than a fixed
> value.
>
> Anyway, we will test PATCH1 to see if it helps in our benchmark and
> come back later.
>
>>> solution 2 is to compare the diff between mm->numa_scan_seq and
>>> vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
>>> scan the vma.
>>>
>>> solution 2 can be used to cover process-based workload(SPECcpu eg). The
>>> reason is: There is only 1 thread within this process. If this process
>>> access the vma at the beginning, then sleeps for a long time, the
>>> pid_active array will be cleared. When this process is woken up, it will
>>> never get a chance to set prot_none anymore. Because only the first 2
>>> times of access is regarded as accessed:
>>> (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
>>> and no other threads can help set this prot_none.
>>>
>>
>> To Summarize: (just thinking loud on the problem IIUC)
>> The issue overall is, we are not handling the scanning of a single
>> (fewer) thread task that sleeps or inactive) some time adequately.
>>
>> one solution is to unconditionally return true (in a way inversely
>> proportional to number of threads in a task).
>>
>> But,
>> 1. Does it regress single (or fewer) threaded tasks which does
>> not really need aggressive scanning.
>
> We haven't observed such regression so far as we don't have a suitable
> workload that can well represent the scenario of "tasks that do not
> need aggressive scanning."
>
> In theory, it will bring extra scanning overhead, but the penalty of
> missing the necessary scanning for the tasks that do need to be migrated
> may be more serious since it can result in long time remote node memory
> access. This is more likely a trade-off between the scanning cost and
> coverage.
>
>> 2. Are we able to address the issue for multi threaded tasks which
>> show similar kind of pattern (viz., inactive for some duration regularly).
>
> Theoretically it should do. If multi-threads access different VMAs of
> their own, like autonuma bench THREAD_LOCAL, each thread can only help
> itself to do the pg_none tagging. We have observed slight performance
> improvement with this patch applied when running autonuma bench
> THREAD_LOCAL.
>
> In common use cases, tasks with multiple threads are likely to share
> some vmas, so there could be higher chance that other threads help tag
> the pg_none for the current thread, thus we can tolerate more vma skip,
> and vice versa.
>
Agree with above points, meanwhile when I ran my normal mmtest,
Results:
base = 6.10-rc4
autonumabench NUMA01
base patched
Amean syst-NUMA01 194.05 ( 0.00%) 165.11 * 14.92%*
Amean elsp-NUMA01 324.86 ( 0.00%) 315.58 * 2.86%*
Duration User 380345.36 368252.04
Duration System 1358.89 1156.23
Duration Elapsed 2277.45 2213.25
autonumabench NUMA02
Amean syst-NUMA02 1.12 ( 0.00%) 1.09 * 2.93%*
Amean elsp-NUMA02 3.50 ( 0.00%) 3.56 * -1.84%*
Duration User 1513.23 1575.48
Duration System 8.33 8.13
Duration Elapsed 28.59 29.71
kernbench
Amean user-256 22935.42 ( 0.00%) 22535.19 * 1.75%*
Amean syst-256 7284.16 ( 0.00%) 7608.72 * -4.46%*
Amean elsp-256 159.01 ( 0.00%) 158.17 * 0.53%*
Duration User 68816.41 67615.74
Duration System 21873.94 22848.08
Duration Elapsed 506.66 504.55
( I have not done bench-marking with smaller threads, some larger
workload run is TBD)
But overall results look promising.
Also on the plus side we have a very simple patch, So
If PeterZ/Mel are okay with using nr_thread notion,
please feel free to add.
Reviewed-and-Tested-by: Raghavendra K T <raghavendra.kt@....com>
Powered by blists - more mailing lists