[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b35ae2b-a401-2489-4b3a-a03e8284d138@amd.com>
Date: Tue, 12 Sep 2023 08:35:12 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: Tim Chen <tim.c.chen@...el.com>, Aaron Lu <aaron.lu@...el.com>,
Dietmar Eggemann <dietmar.eggemann@....com>,
Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Daniel Bristot de Oliveira <bristot@...hat.com>,
Valentin Schneider <vschneid@...hat.com>,
"Gautham R . Shenoy" <gautham.shenoy@....com>,
linux-kernel@...r.kernel.org,
Peter Zijlstra <peterz@...radead.org>,
Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
Ingo Molnar <mingo@...hat.com>,
Vincent Guittot <vincent.guittot@...aro.org>,
Juri Lelli <juri.lelli@...hat.com>
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in
select_idle_cpu()
Hello Chenyu,
On 9/11/2023 3:49 PM, Chen Yu wrote:
> Hi Prateek,
>
> thanks for your review,
>
> On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>> [..snip..]
>>> kernel/sched/fair.c | 30 +++++++++++++++++++++++++++---
>>> kernel/sched/features.h | 1 +
>>> kernel/sched/sched.h | 1 +
>>> 3 files changed, 29 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>> hrtick_update(rq);
>>> now = sched_clock_cpu(cpu_of(rq));
>>> p->se.prev_sleep_time = task_sleep ? now : 0;
>>> +#ifdef CONFIG_SMP
>>> + /*
>>> + * If this rq will become idle, and dequeued task is
>>> + * a short sleeping one, check if we can reserve
>>> + * this idle CPU for that task for a short while.
>>> + * During this reservation period, other wakees will
>>> + * skip this 'idle' CPU in select_idle_cpu(), and this
>>> + * short sleeping task can pick its previous CPU in
>>> + * select_idle_sibling(), which brings better cache
>>> + * locality.
>>> + */
>>> + if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> + p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
>>> + rq->cache_hot_timeout = now + p->se.sleep_avg;
>>> +#endif
>>> }
>>>
>>> #ifdef CONFIG_SMP
>>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>>> static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>>> {
>>> if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>>> - sched_cpu_cookie_match(cpu_rq(cpu), p))
>>> + sched_cpu_cookie_match(cpu_rq(cpu), p)) {
>>> + if (sched_feat(SIS_CACHE) &&
>>> + sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
>>> + return -1;
>>
>> Just wondering,
>>
>> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
>> finding one in its search for an idle core, would returning a "cache-hot
>> idle CPU" be better than returning previous CPU / current CPU if all
>> idle CPUs found during the search in select_idle_cpu() are marked
>> cache-hot?
>>
>
> This is a good point, we can optimize this further. Currently I only
> send a simple version to desmonstrate how we can leverage the task's
> sleep time.
>
>> Speaking of cache-hot idle CPU, is netperf actually more happy with
>> piling on current CPU?
>
> Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> put the waker and wakee together.
>
>> I ask this because the logic seems to be
>> reserving the previous CPU for a task that dislikes migration but I
>> do not see anything in the wake_affine_idle() path that would make the
>> short sleeper proactively choose the previous CPU when the wakeup is
>> marked with the WF_SYNC flag. Let me know if I'm missing something?
>>
>
> If I understand correctly, WF_SYNC is to let the wakee to woken up
> on the waker's CPU, rather than the wakee's previous CPU, because
> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> wakee's previous CPU. We can only restrict that other wakee does not
> occupy the previous CPU, but do not enhance the possibility that
> wake_affine_idle() chooses the previous CPU.
Correct me if I'm wrong here,
Say a short sleeper, is always woken up using WF_SYNC flag. When the
task is dequeued, we mark the previous CPU where it ran as "cache-hot"
and restrict any wakeup happening until the "cache_hot_timeout" is
crossed. Let us assume a perfect world where the task wakes up before
the "cache_hot_timeout" expires. Logically this CPU was reserved all
this while for the short sleeper but since the wakeup bears WF_SYNC
flag, the whole reservation is ignored and waker's LLC is explored.
Should the timeout be cleared if the wakeup decides to not target the
previous CPU? (The default "sysctl_sched_migration_cost" is probably
small enough to curb any side effect that could possibly show here but
if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
a larger value, the wakeup path might be affected where lot of idle
targets are overlooked since the CPUs are marked cache-hot forr longer
duration)
Let me know what you think.
>
> Say, there are two tasks t1, t2. t1's previous CPU is p1.
> We don't enhance that when t1 is woken up, wake_affine_idle() will
> choose p1 or not, but we makes sure t2 will not choose p1.
>
>> To confirm this can you look at the trend in migration count with and
>> without the series? Also the ratio of cache-hot idle CPUs to number
>> of CPUs searched can help estimate overheads of additional search - I
>> presume SIS_UTIL is efficient at curbing the additional search in
>> a busy system.
>
> OK, I'll collect these statistics.
Thank you :)
>
> [..snip..]
>
--
Thanks and Regards,
Prateek
Powered by blists - more mailing lists