linux-kernel - Re: [RFC PATCH] sched/numa: scan the vma if it has not been scanned for a while

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZnEk3yARRPP9WXr9@yujie-X299>
Date: Tue, 18 Jun 2024 14:10:39 +0800
From: Yujie Liu <yujie.liu@...el.com>
To: Raghavendra K T <raghavendra.kt@....com>
CC: Chen Yu <yu.c.chen@...el.com>, Mel Gorman <mgorman@...hsingularity.net>,
	Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, "Juri
 Lelli" <juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Chen Yu <yu.chen.surf@...il.com>, Tim Chen <tim.c.chen@...el.com>,
	<linux-kernel@...r.kernel.org>, Xiaoping Zhou <xiaoping.zhou@...el.com>
Subject: Re: [RFC PATCH] sched/numa: scan the vma if it has not been scanned
 for a while

Hi Raghu,

On Tue, Jun 18, 2024 at 12:41:05AM +0530, Raghavendra K T wrote:
> On 6/14/2024 10:26 AM, Chen Yu wrote:
> > From: Yujie Liu <yujie.liu@...el.com>
> > 
> > Problem statement:
> > Since commit fc137c0ddab2 ("sched/numa: enhance vma scanning logic"), the
> > Numa vma scan overhead has been reduced a lot. Meanwhile, it could be
> > a double-sword that, the reducing of the vma scan might create less Numa
> > page fault information. The insufficient information makes it harder for
> > the Numa balancer to make decision. Later,
> > commit b7a5b537c55c08 ("sched/numa: Complete scanning of partial VMAs
> > regardless of PID activity") and commit 84db47ca7146d7 ("sched/numa: Fix
> > mm numa_scan_seq based unconditional scan") are found to bring back part
> > of the performance.
> > 
> > Recently when running SPECcpu on a 320 CPUs/2 Sockets system, a long
> > duration of remote Numa node read was observed by PMU events. It causes
> > high core-to-core variance and performance penalty. After the
> > investigation, it is found that many vmas are skipped due to the active
> > PID check. According to the trace events, in most cases, vma_is_accessed()
> > returns false because both pids_active[0] and pids_active[1] have been
> > cleared.
> > 
> 
> Thank you for reporting this and also giving potential fix.
> I do think this is a good fix to start with.

Thanks a lot for your valuable comments and suggestions.

> > As an experiment, if the vma_is_accessed() is hacked to always return true,
> > the long duration remote Numa access is gone.
> > 
> > Proposal:
> > The main idea is to adjust vma_is_accessed() to let it return true easier.
> > 
> > solution 1 is to extend the pids_active[] from 2 to N, which has already
> > been proposed by Peter[1]. And how to decide N needs investigation.
> > 
> 
> I am curious if this (PeterZ's suggestion) implementation in PATCH1 of
> link:
> https://lore.kernel.org/linux-mm/cover.1710829750.git.raghavendra.kt@amd.com/
> 
> get some benefit. I did not see good usecase at that point. but worth a
> try to see if it improves performance in your case.

PATCH1 extends the array size of pids_active[] from 2 to 4, so the
history info can be kept for a longer time, but it is possible that the
scanning could still be missed after the task sleeps for a long enough
time. It seems that the N could be task-specific rather than a fixed
value.

Anyway, we will test PATCH1 to see if it helps in our benchmark and
come back later.

> > solution 2 is to compare the diff between mm->numa_scan_seq and
> > vma->numab_state->prev_scan_seq. If the diff has exceeded the threshold,
> > scan the vma.
> > 
> > solution 2 can be used to cover process-based workload(SPECcpu eg). The
> > reason is: There is only 1 thread within this process. If this process
> > access the vma at the beginning, then sleeps for a long time, the
> > pid_active array will be cleared. When this process is woken up, it will
> > never get a chance to set prot_none anymore. Because only the first 2
> > times of access is regarded as accessed:
> > (current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2
> > and no other threads can help set this prot_none.
> > 
> 
> To Summarize: (just thinking loud on the problem IIUC)
> The issue overall is, we are not handling the scanning of a single
> (fewer) thread task that sleeps or inactive) some time adequately.
> 
> one solution is to unconditionally return true (in a way inversely
> proportional to number of threads in a task).
> 
> But,
> 1. Does it regress single (or fewer) threaded tasks which does
>  not really need aggressive scanning.

We haven't observed such regression so far as we don't have a suitable
workload that can well represent the scenario of "tasks that do not
need aggressive scanning."

In theory, it will bring extra scanning overhead, but the penalty of
missing the necessary scanning for the tasks that do need to be migrated
may be more serious since it can result in long time remote node memory
access. This is more likely a trade-off between the scanning cost and
coverage.

> 2. Are we able to address the issue for multi threaded tasks which
> show similar kind of pattern (viz., inactive for some duration regularly).

Theoretically it should do. If multi-threads access different VMAs of
their own, like autonuma bench THREAD_LOCAL, each thread can only help
itself to do the pg_none tagging. We have observed slight performance
improvement with this patch applied when running autonuma bench
THREAD_LOCAL.

In common use cases, tasks with multiple threads are likely to share
some vmas, so there could be higher chance that other threads help tag
the pg_none for the current thread, thus we can tolerate more vma skip,
and vice versa.

Thanks,
Yujie

> Having said this,
> I do not have any thing strong against the approach.
> I will also try to reproduce the issue, mean while thinking, if there could
> be a better approach.
> 
> (unrelated to this, /me still think more scanning needed for tasks with
>  a bigger vma something like PATCH3 in same link given above).
> 
> > This patch is mainly to raise this question, and seek for suggestion from
> > the community to handle it properly. Thanks in advance for any suggestion.
> > 
> > Link: https://lore.kernel.org/lkml/Y9zxkGf50bqkucum@hirez.programming.kicks-ass.net/ #1
> > Reported-by: Xiaoping Zhou <xiaoping.zhou@...el.com>
> > Co-developed-by: Chen Yu <yu.c.chen@...el.com>
> > Signed-off-by: Chen Yu <yu.c.chen@...el.com>
> > Signed-off-by: Yujie Liu <yujie.liu@...el.com>
> > ---
> >   kernel/sched/fair.c | 8 ++++++++
> >   1 file changed, 8 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8a5b1ae0aa55..2b74fc06fb95 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3188,6 +3188,14 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
> >   		return true;
> >   	}
> > +	/*
> > +	 * This vma has not been accessed for a while, and has limited number of threads
> > +	 * within the current task can help.
> > +	 */
> > +	if (READ_ONCE(mm->numa_scan_seq) >
> > +	   (vma->numab_state->prev_scan_seq + get_nr_threads(current)))
> > +		return true;
> > +
> 
> I see we do update prev_scan_seq to current numa_scan_seq at the
> end of scanning. So we are good here, by just returning true.
> 
> >   	return false;
> >   }
> 
> Thanks and Regards
> - Raghu