linux-kernel - Re: [PATCH 1/2] sched/fair: Record the average duration of a task

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZoVOL3mOVFAGEmZV@chenyu5-mobl2>
Date: Wed, 3 Jul 2024 21:12:15 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: Raghavendra K T <raghavendra.kt@....com>
CC: Mike Galbraith <efault@....de>, Peter Zijlstra <peterz@...radead.org>,
	Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, "Vincent
 Guittot" <vincent.guittot@...aro.org>, Tim Chen <tim.c.chen@...el.com>,
	"Yujie Liu" <yujie.liu@...el.com>, K Prateek Nayak <kprateek.nayak@....com>,
	"Gautham R . Shenoy" <gautham.shenoy@....com>, Chen Yu
	<yu.chen.surf@...il.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched/fair: Record the average duration of a task

On 2024-07-03 at 14:04:47 +0530, Raghavendra K T wrote:
> 
> 
> On 7/1/2024 8:27 PM, Chen Yu wrote:
> > Hi Mike,
> > 
> > On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
> > > On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
> > > > Hi Mike,
> > > > 
> > > > Thanks for your time and giving the insights.
> > 
> > According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
> > share the same L2 cache, around 20% improvement was noticed (though not as much
> > as on the non-L2 shared platform). I haven't delved into the details yet, but my
> > understanding is that L1 cache-to-cache latency within the L2 domain might also
> > matter on large servers (which I need to investigate further).
> > 
> > > 1:N or M:N
> > > tasks can approach its wakeup frequency range, and there's nothing you can do
> > > about the very same cache to cache latency you're trying to duck, it
> > > just is what it is, and is considered perfectly fine as it is.  That's
> > > a bit of a red flag, but worse is the lack of knowledge wrt what tasks
> > > are actually up to at any given time.  We rashly presume that tasks
> > > waking one another implies a 1:1 relationship, we routinely call them
> > > buddies and generally get away with it.. but during any overlap they
> > > can be doing anything including N way data share, and regardless of
> > > what that is and section size, needless stacking flushes concurrency,
> > > injecting service latency in its place, cost unknown.
> > > 
> > 
> > I believe this is a generic issue that the current scheduler faces, where
> > it attempts to predict the task's behavior based on its runtime. For instance,
> > task_hot() checks the task runtime to predict whether the task is cache-hot,
> > regardless of what the task does during its time slice. This is also the case
> > with WF_SYNC, which provides the scheduler with a hint to wake up on the current
> > CPU to potentially benefit from cache locality.
> > 
> > A thought occurred to me that one possible method to determine if the waker
> > and wakee share data could be to leverage the NUMA balance's numa_group data structure.
> > As numa balance periodically scans the task's VMA space and groups tasks accessing
> > the same physical page into one numa_group, we can infer that if the waker and wakee
> > are within the same numa_group, they are likely to share data, and it might be
> > appropriate to place the wakee on top of the waker.
> > 
> > CC Raghavendra here in case he has any insights.
> > 
> 
> Agree with your thought here,
>

Thanks for taking a look at this, Raghavendra.

> So I imagine two possible things to explore here.
> 
> 1) Use task1, task2 numa_group and check if they belong to same
> numa_group, also check if there is a possibility of M:N relationship
> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>

I see, so do you mean if it is M:N rather than 1:1, we should avoid the
task been woken up on current CPU to avoid task stacking?
 
> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
> (threads) possibly interested in same VMA.
> Latter one looks to be practically difficult because we don't want to
> sweep across VMAs perhaps..
>

Yeah, we might have to scan the whole VMAs to gather the PID information which
might be a little costly(or maybe subset of the VMAs). And the pids_active[] is
for threads rather than process, stacking the threads seem to not be good
enough(per Mike's comments)

Anyway, I'm preparing for some full tests to see if there is overall benefit
from current version. Later let's investigate if numa balance information could
help here.

thanks,
Chenyu