lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZnrhxnFRNI7CWlFG@yujie-X299>
Date: Tue, 25 Jun 2024 23:27:02 +0800
From: Yujie Liu <yujie.liu@...el.com>
To: Raghavendra K T <raghavendra.kt@....com>
CC: Chen Yu <yu.c.chen@...el.com>, Mel Gorman <mgorman@...hsingularity.net>,
	Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Chen Yu <yu.chen.surf@...il.com>, Tim Chen <tim.c.chen@...el.com>,
	<linux-kernel@...r.kernel.org>, Xiaoping Zhou <xiaoping.zhou@...el.com>
Subject: Re: [RFC PATCH] sched/numa: scan the vma if it has not been scanned
 for a while

On Mon, Jun 24, 2024 at 11:26:50AM +0530, Raghavendra K T wrote:
> On 6/18/2024 11:40 AM, Yujie Liu wrote:
> > Hi Raghu,
> > 
> > On Tue, Jun 18, 2024 at 12:41:05AM +0530, Raghavendra K T wrote:
> > > On 6/14/2024 10:26 AM, Chen Yu wrote:
> > > > From: Yujie Liu <yujie.liu@...el.com>
> > > > 
> > > > Problem statement:
> > > > Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
> > > > Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
> > > > a double-sword that, the reducing of the vma scan might create less Numa
> > > > page fault information. The insufficient information makes it harder for
> > > > the Numa balancer to make decision. Later,
> > > > commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
> > > > regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
> > > > mm numa_scan_seq based unconditional scan") are found to bring back part
> > > > of the performance.
> > > > 
> > > > Recently when running SPECcpu on a 320 CPUs/2 Sockets system, a long
> > > > duration of remote Numa node read was observed by PMU events. It causes
> > > > high core-to-core variance and performance penalty. After the
> > > > investigation, it is found that many vmas are skipped due to the active
> > > > PID check. According to the trace events, in most cases, vma_is_accessed()
> > > > returns false because both pids_active[0] and pids_active[1] have been
> > > > cleared.
> > > > 
> > > 
> > > Thank you for reporting this and also giving potential fix.
> > > I do think this is a good fix to start with.
> > 
> > Thanks a lot for your valuable comments and suggestions.
> > 
> > > > As an experiment, if the vma_is_accessed() is hacked to always return true,
> > > > the long duration remote Numa access is gone.
> > > > 
> > > > Proposal:
> > > > The main idea is to adjust vma_is_accessed() to let it return true easier.
> > > > 
> > > > solution 1 is to extend the pids_active[] from 2 to N, which has already
> > > > been proposed by Peter[1]. And how to decide N needs investigation.
> > > > 
> > > 
> > > I am curious if this (PeterZ's suggestion) implementation in PATCH1 of
> > > link:
> > > https://lore.kernel.org/linux-mm/cover.1710829750.git.raghavendra.kt@amd.com/
> > > 
> > > get some benefit. I did not see good usecase at that point. but worth a
> > > try to see if it improves performance in your case.
> > 
> > PATCH1 extends the array size of pids_active[] from 2 to 4, so the
> > history info can be kept for a longer time, but it is possible that the
> > scanning could still be missed after the task sleeps for a long enough
> > time. It seems that the N could be task-specific rather than a fixed
> > value.
> > 
> > Anyway, we will test PATCH1 to see if it helps in our benchmark and
> > come back later.

Sorry for my late reply. Our test machine was not available temporarily
for the last few days, and I finished the testing just now. After
applying PATCH1, the average level of remote memory access doesn't have
significant change in our benchmark (SPEC CPU omnetpp_r test), and we
can still see a few cores having ~500MB/s remote memory access for ~20
seconds, which may indicate some potential missed vma scans.

It seems that extending the pids_active size to 4 doesn't have
significant impact on this specific benchmark we used, but it may
help on other cases. Chenyu has tested PATCH1 with autonuma benchmark,
and replied to that thread with detailed data [1].

[1] https://lore.kernel.org/all/ZnrSIGKBpyeTmSJt@chenyu5-mobl2/

> > > > solution 2 is to compare the diff between mm->numa_scan_seq and
> > > > vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
> > > > scan the vma.
> > > > 
> > > > solution 2 can be used to cover process-based workload(SPECcpu eg). The
> > > > reason is: There is only 1 thread within this process. If this process
> > > > access the vma at the beginning, then sleeps for a long time, the
> > > > pid_active array will be cleared. When this process is woken up, it will
> > > > never get a chance to set prot_none anymore. Because only the first 2
> > > > times of access is regarded as accessed:
> > > > (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
> > > > and no other threads can help set this prot_none.
> > > > 
> > > 
> > > To Summarize: (just thinking loud on the problem IIUC)
> > > The issue overall is, we are not handling the scanning of a single
> > > (fewer) thread task that sleeps or inactive) some time adequately.
> > > 
> > > one solution is to unconditionally return true (in a way inversely
> > > proportional to number of threads in a task).
> > > 
> > > But,
> > > 1. Does it regress single (or fewer) threaded tasks which does
> > >   not really need aggressive scanning.
> > 
> > We haven't observed such regression so far as we don't have a suitable
> > workload that can well represent the scenario of "tasks that do not
> > need aggressive scanning."
> > 
> > In theory, it will bring extra scanning overhead, but the penalty of
> > missing the necessary scanning for the tasks that do need to be migrated
> > may be more serious since it can result in long time remote node memory
> > access. This is more likely a trade-off between the scanning cost and
> > coverage.
> > 
> > > 2. Are we able to address the issue for multi threaded tasks which
> > > show similar kind of pattern (viz., inactive for some duration regularly).
> > 
> > Theoretically it should do. If multi-threads access different VMAs of
> > their own, like autonuma bench THREAD_LOCAL, each thread can only help
> > itself to do the pg_none tagging. We have observed slight performance
> > improvement with this patch applied when running autonuma bench
> > THREAD_LOCAL.
> > 
> > In common use cases, tasks with multiple threads are likely to share
> > some vmas, so there could be higher chance that other threads help tag
> > the pg_none for the current thread, thus we can tolerate more vma skip,
> > and vice versa.
> > 
> 
> Agree with above points, meanwhile when I ran my normal mmtest,
> Results:
> 
> base = 6.10-rc4
> 
> autonumabench NUMA01
>                            base                  patched
> Amean     syst-NUMA01      194.05 (   0.00%)      165.11 *  14.92%*
> Amean     elsp-NUMA01      324.86 (   0.00%)      315.58 *   2.86%*
> 
> Duration User      380345.36   368252.04
> Duration System      1358.89     1156.23
> Duration Elapsed     2277.45     2213.25
> 
> 
> autonumabench NUMA02
> 
> Amean     syst-NUMA02        1.12 (   0.00%)        1.09 *   2.93%*
> Amean     elsp-NUMA02        3.50 (   0.00%)        3.56 *  -1.84%*
> 
> Duration User        1513.23     1575.48
> Duration System         8.33        8.13
> Duration Elapsed       28.59       29.71
> 
> kernbench
> Amean     user-256    22935.42 (   0.00%)    22535.19 *   1.75%*
> Amean     syst-256     7284.16 (   0.00%)     7608.72 *  -4.46%*
> Amean     elsp-256      159.01 (   0.00%)      158.17 *   0.53%*
> 
> Duration User       68816.41    67615.74
> Duration System     21873.94    22848.08
> Duration Elapsed      506.66      504.55
> 
> ( I have not done bench-marking with smaller threads, some larger
> workload run is TBD)
> 
> But overall results look promising.
> Also on the plus side we have a very simple patch, So
> 
> If PeterZ/Mel are okay with using nr_thread notion,
> please feel free to add.
> 
> Reviewed-and-Tested-by: Raghavendra K T <raghavendra.kt@....com>

Thanks a lot for reviewing and testing this patch! We will improve the
commit message and send out the formal patch soon.

Best Regards,
Yujie

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ