linux-kernel - Re: [RFC PATCH] sched/fair: Choose the CPU where short task is running during wake up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <dce80b8e-8a2c-d498-bdf9-0ca3ca340ce7@amd.com>
Date:   Thu, 29 Sep 2022 22:49:24 +0530
From:   K Prateek Nayak <kprateek.nayak@....com>
To:     Chen Yu <yu.c.chen@...el.com>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Vincent Guittot <vincent.guittot@...aro.org>,
        Tim Chen <tim.c.chen@...el.com>,
        Mel Gorman <mgorman@...hsingularity.net>,
        Juri Lelli <juri.lelli@...hat.com>,
        Rik van Riel <riel@...riel.com>,
        Aaron Lu <aaron.lu@...el.com>,
        Abel Wu <wuyun.abel@...edance.com>,
        Yicong Yang <yangyicong@...ilicon.com>,
        "Gautham R . Shenoy" <gautham.shenoy@....com>,
        Ingo Molnar <mingo@...hat.com>,
        Dietmar Eggemann <dietmar.eggemann@....com>,
        Steven Rostedt <rostedt@...dmis.org>,
        Ben Segall <bsegall@...gle.com>,
        Daniel Bristot de Oliveira <bristot@...hat.com>,
        Valentin Schneider <vschneid@...hat.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [RFC PATCH] sched/fair: Choose the CPU where short task is
 running during wake up

Hello Chenyu,

Thank you for looking into this issue.

On 9/29/2022 10:55 AM, Chen Yu wrote:
> Hi Prateek,
> [..snip..]
>
>>>  kernel/sched/fair.c | 31 ++++++++++++++++++++++++++++++-
>>>  1 file changed, 30 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 914096c5b1ae..7519ab5b911c 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6020,6 +6020,19 @@ static int wake_wide(struct task_struct *p)
>>>  	return 1;
>>>  }
>>>  
>>> +/*
>>> + * If a task switches in and then voluntarily relinquishes the
>>> + * CPU quickly, it is regarded as a short running task.
>>> + * sysctl_sched_min_granularity is chosen as the threshold,
>>> + * as this value is the minimal slice if there are too many
>>> + * runnable tasks, see __sched_period().
>>> + */
>>> +static int is_short_task(struct task_struct *p)
>>> +{
>>> +	return (p->se.sum_exec_runtime <=
>>> +		(p->nvcsw * sysctl_sched_min_granularity));
>>> +}
>>> +
>>>  /*
>>>   * The purpose of wake_affine() is to quickly determine on which CPU we can run
>>>   * soonest. For the purpose of speed we only consider the waking and previous
>>> @@ -6050,7 +6063,8 @@ wake_affine_idle(int this_cpu, int prev_cpu, int sync)
>>>  	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
>>>  		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
>>>  
>>> -	if (sync && cpu_rq(this_cpu)->nr_running == 1)
>>> +	if ((sync && cpu_rq(this_cpu)->nr_running == 1) ||
>>> +	    is_short_task(cpu_curr(this_cpu)))
>>
>> This change seems to optimize for affine wakeup which benefits
>> tasks with producer-consumer pattern but is not ideal for Stream.
>> Currently the logic ends will do an affine wakeup even if sync
>> flag is not set:
>>
>>           stream-4135    [029] d..2.   353.580953: sched_waking: comm=stream pid=4129 prio=120 target_cpu=082
>>           stream-4135    [029] d..2.   353.580957: select_task_rq_fair: wake_affine_idle: Select this_cpu: sync(0) rq->nr_running(1) is_short_task(1)
>>           stream-4135    [029] d..2.   353.580960: sched_migrate_task: comm=stream pid=4129 prio=120 orig_cpu=82 dest_cpu=30
>>           <idle>-0       [030] dNh2.   353.580993: sched_wakeup: comm=stream pid=4129 prio=120 target_cpu=030
>>
>> I believe a consideration should be made for the sync flag when
>> going for an affine wakeup. Also the check for short running could
>> be at the end after checking if prev_cpu is an available_idle_cpu.
>>
> We can move the short running check after the prev_cpu check. If we
> add the sync flag check would it shrink the coverage of this change?

I've ran some test where I just move the condition to check for
short running towards the end of task wake_affine_idle and also
incorporated suggestion from Tim in wake_affine_idle. I've shared
the results in a parallel thread.

> Since I found that there is limited scenario would enable the sync
> flag and we want to make the short running check a generic optimization.
> But yes, we can test with/without sync flag constrain to see which one
> gives better data.
>>>  		return this_cpu;
>>>  
>>>  	if (available_idle_cpu(prev_cpu))
>>> @@ -6434,6 +6448,21 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>  			/* overloaded LLC is unlikely to have idle cpu/core */
>>>  			if (nr == 1)
>>>  				return -1;
>>> +
>>> +			/*
>>> +			 * If nr is smaller than 60% of llc_weight, it
>>> +			 * indicates that the util_avg% is higher than 50%.
>>> +			 * This is calculated by SIS_UTIL in
>>> +			 * update_idle_cpu_scan(). The 50% util_avg indicates
>>> +			 * a half-busy LLC domain. System busier than this
>>> +			 * level could lower its bar to choose a compromised
>>> +			 * "idle" CPU. If the waker on target CPU is a short
>>> +			 * task and the wakee is also a short task, pick
>>> +			 * target directly.
>>> +			 */
>>> +			if (!has_idle_core && (5 * nr < 3 * sd->span_weight) &&
>>> +			    is_short_task(p) && is_short_task(cpu_curr(target)))
>>> +				return target;
>>
>> Pileup seen in hackbench could also be a result of an early
>> bailout here for smaller LLCs but I don't have any data to
>> substantiate that claim currently.
>>
>>>  		}
>>>  	}
>>>  
>> Please let me know if you need any more data from the test
>> system for any of the benchmarks covered or if you would like
>> me to run any other benchmark on the test system.
> Thank you for your testing, I'll enable SNC to divide the LLC domain
> into smaller ones, and to see if the issue could be reproduced
> on my platform too, then I'll update my finding on this.

Thank you for testing with SNC enabled. It should get the LLC size
closer to the Zen3 system I've tested on.

> 
> thanks,
> Chenyu
--
Thanks and Regards,
Prateek