linux-kernel - Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in select_idle

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5b35ae2b-a401-2489-4b3a-a03e8284d138@amd.com>
Date:   Tue, 12 Sep 2023 08:35:12 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Chen Yu <yu.c.chen@...el.com>
Cc:     Tim Chen <tim.c.chen@...el.com>, Aaron Lu <aaron.lu@...el.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        linux-kernel@...r.kernel.org,
        Peter Zijlstra <peterz@...radead.org>,
        Mathieu Desnoyers <mathieu.desnoyers@...icios.com>,
        Ingo Molnar <mingo@...hat.com>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Juri Lelli <juri.lelli@...hat.com>
Subject: Re: [RFC PATCH 2/2] sched/fair: skip the cache hot CPU in
 select_idle_cpu()

Hello Chenyu,

On 9/11/2023 3:49 PM, Chen Yu wrote:
> Hi Prateek,
> 
> thanks for your review,
> 
> On 2023-09-11 at 13:59:10 +0530, K Prateek Nayak wrote:
>> Hello Chenyu,
>>
>> On 9/11/2023 8:20 AM, Chen Yu wrote:
>>>  [..snip..]
>>>  kernel/sched/fair.c     | 30 +++++++++++++++++++++++++++---
>>>  kernel/sched/features.h |  1 +
>>>  kernel/sched/sched.h    |  1 +
>>>  3 files changed, 29 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index e20f50726ab8..fe3b760c9654 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6629,6 +6629,21 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>>  	hrtick_update(rq);
>>>  	now = sched_clock_cpu(cpu_of(rq));
>>>  	p->se.prev_sleep_time = task_sleep ? now : 0;
>>> +#ifdef CONFIG_SMP
>>> +	/*
>>> +	 * If this rq will become idle, and dequeued task is
>>> +	 * a short sleeping one, check if we can reserve
>>> +	 * this idle CPU for that task for a short while.
>>> +	 * During this reservation period, other wakees will
>>> +	 * skip this 'idle' CPU in select_idle_cpu(), and this
>>> +	 * short sleeping task can pick its previous CPU in
>>> +	 * select_idle_sibling(), which brings better cache
>>> +	 * locality.
>>> +	 */
>>> +	if (sched_feat(SIS_CACHE) && task_sleep && !rq->nr_running &&
>>> +	    p->se.sleep_avg && p->se.sleep_avg < sysctl_sched_migration_cost)
>>> +		rq->cache_hot_timeout = now + p->se.sleep_avg;
>>> +#endif
>>>  }
>>>  
>>>  #ifdef CONFIG_SMP
>>> @@ -6982,8 +6997,13 @@ static inline int find_idlest_cpu(struct sched_domain *sd, struct task_struct *p
>>>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>>>  {
>>>  	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
>>> -	    sched_cpu_cookie_match(cpu_rq(cpu), p))
>>> +	    sched_cpu_cookie_match(cpu_rq(cpu), p)) {
>>> +		if (sched_feat(SIS_CACHE) &&
>>> +		    sched_clock_cpu(cpu) < cpu_rq(cpu)->cache_hot_timeout)
>>> +			return -1;
>>
>> Just wondering,
>>
>> Similar to how select_idle_core() caches the "idle_cpu" if it ends up
>> finding one in its search for an idle core, would returning a "cache-hot
>> idle CPU" be better than returning previous CPU / current CPU if all
>> idle CPUs found during the search in select_idle_cpu() are marked
>> cache-hot?
>>
> 
> This is a good point, we can optimize this further. Currently I only
> send a simple version to desmonstrate how we can leverage the task's
> sleep time.
> 
>> Speaking of cache-hot idle CPU, is netperf actually more happy with
>> piling on current CPU?
> 
> Yes. Per my previous test, netperf of TCP_RR/UDP_RR really likes to
> put the waker and wakee together.
> 
>> I ask this because the logic seems to be
>> reserving the previous CPU for a task that dislikes migration but I
>> do not see anything in the wake_affine_idle() path that would make the
>> short sleeper proactively choose the previous CPU when the wakeup is
>> marked with the WF_SYNC flag. Let me know if I'm missing something?
>>
> 
> If I understand correctly, WF_SYNC is to let the wakee to woken up
> on the waker's CPU, rather than the wakee's previous CPU, because
> the waker goes to sleep after wakeup. SIS_CACHE mainly cares about
> wakee's previous CPU. We can only restrict that other wakee does not
> occupy the previous CPU, but do not enhance the possibility that
> wake_affine_idle() chooses the previous CPU.

Correct me if I'm wrong here,

Say a short sleeper, is always woken up using WF_SYNC flag. When the
task is dequeued, we mark the previous  CPU where it ran as "cache-hot"
and restrict any wakeup happening until the "cache_hot_timeout" is
crossed. Let us assume a perfect world where the task wakes up before
the "cache_hot_timeout" expires. Logically this CPU was reserved all
this while for the short sleeper but since the wakeup bears WF_SYNC
flag, the whole reservation is ignored and waker's LLC is explored.

Should the timeout be cleared if the wakeup decides to not target the
previous CPU? (The default "sysctl_sched_migration_cost" is probably
small enough to curb any side effect that could possibly show here but
if a genuine use-case warrants setting "sysctl_sched_migration_cost" to
a larger value, the wakeup path might be affected where lot of idle
targets are overlooked since the CPUs are marked cache-hot forr longer
duration)

Let me know what you think.

> 
> Say, there are two tasks t1, t2. t1's previous CPU is p1.
> We don't enhance that when t1 is woken up, wake_affine_idle() will
> choose p1 or not, but we makes sure t2 will not choose p1.
> 
>> To confirm this can you look at the trend in migration count with and
>> without the series? Also the ratio of cache-hot idle CPUs to number
>> of CPUs searched can help estimate overheads of additional search - I
>> presume SIS_UTIL is efficient at curbing the additional search in
>> a busy system.
> 
> OK, I'll collect these statistics.

Thank you :)

> 
> [..snip..]
> 

--
Thanks and Regards,
Prateek