[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1a90a564-8fb3-68c3-361b-ac337386c32c@amd.com>
Date: Wed, 3 Jul 2024 19:16:35 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Chen Yu <yu.c.chen@...el.com>
CC: Mike Galbraith <efault@....de>, Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>, Juri Lelli <juri.lelli@...hat.com>, "Vincent
Guittot" <vincent.guittot@...aro.org>, Tim Chen <tim.c.chen@...el.com>,
"Yujie Liu" <yujie.liu@...el.com>, K Prateek Nayak <kprateek.nayak@....com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>, Chen Yu
<yu.chen.surf@...il.com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/2] sched/fair: Record the average duration of a task
On 7/3/2024 6:42 PM, Chen Yu wrote:
> On 2024-07-03 at 14:04:47 +0530, Raghavendra K T wrote:
>>
>>
>> On 7/1/2024 8:27 PM, Chen Yu wrote:
>>> Hi Mike,
>>>
>>> On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:
>>>> On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:
>>>>> Hi Mike,
>>>>>
>>>>> Thanks for your time and giving the insights.
>>>
>>> According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
>>> share the same L2 cache, around 20% improvement was noticed (though not as much
>>> as on the non-L2 shared platform). I haven't delved into the details yet, but my
>>> understanding is that L1 cache-to-cache latency within the L2 domain might also
>>> matter on large servers (which I need to investigate further).
>>>
>>>> 1:N or M:N
>>>> tasks can approach its wakeup frequency range, and there's nothing you can do
>>>> about the very same cache to cache latency you're trying to duck, it
>>>> just is what it is, and is considered perfectly fine as it is. That's
>>>> a bit of a red flag, but worse is the lack of knowledge wrt what tasks
>>>> are actually up to at any given time. We rashly presume that tasks
>>>> waking one another implies a 1:1 relationship, we routinely call them
>>>> buddies and generally get away with it.. but during any overlap they
>>>> can be doing anything including N way data share, and regardless of
>>>> what that is and section size, needless stacking flushes concurrency,
>>>> injecting service latency in its place, cost unknown.
>>>>
>>>
>>> I believe this is a generic issue that the current scheduler faces, where
>>> it attempts to predict the task's behavior based on its runtime. For instance,
>>> task_hot() checks the task runtime to predict whether the task is cache-hot,
>>> regardless of what the task does during its time slice. This is also the case
>>> with WF_SYNC, which provides the scheduler with a hint to wake up on the current
>>> CPU to potentially benefit from cache locality.
>>>
>>> A thought occurred to me that one possible method to determine if the waker
>>> and wakee share data could be to leverage the NUMA balance's numa_group data structure.
>>> As numa balance periodically scans the task's VMA space and groups tasks accessing
>>> the same physical page into one numa_group, we can infer that if the waker and wakee
>>> are within the same numa_group, they are likely to share data, and it might be
>>> appropriate to place the wakee on top of the waker.
>>>
>>> CC Raghavendra here in case he has any insights.
>>>
>>
>> Agree with your thought here,
>>
>
> Thanks for taking a look at this, Raghavendra.
>
>> So I imagine two possible things to explore here.
>>
>> 1) Use task1, task2 numa_group and check if they belong to same
>> numa_group, also check if there is a possibility of M:N relationship
>> by checking if t1/t2->numa_group->nr_tasks > 1 etc
>>
>
> I see, so do you mean if it is M:N rather than 1:1, we should avoid the
> task been woken up on current CPU to avoid task stacking?
Not sure actually, perhaps depends on usecase. But atleast gives an idea
on the relationship.
Problem here is we only know that they belong to same numa_group,
But we cannot deduce actual relationship (we only know it is > 1)
(1:N or M:N is not known).
>
>> 2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
>> (threads) possibly interested in same VMA.
>> Latter one looks to be practically difficult because we don't want to
>> sweep across VMAs perhaps..
>>
>
> Yeah, we might have to scan the whole VMAs to gather the PID information which
> might be a little costly(or maybe subset of the VMAs). And the pids_active[] is
> for threads rather than process, stacking the threads seem to not be good
> enough(per Mike's comments)
>
> Anyway, I'm preparing for some full tests to see if there is overall benefit
> from current version. Later let's investigate if numa balance information could
> help here.
>
Powered by blists - more mailing lists