linux-kernel - Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f2eafde3-b9e8-afc9-8934-ca8e597c33e6@amd.com>
Date: Tue, 14 May 2024 20:53:16 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Chen Yu <yu.c.chen@...el.com>
Cc: Peter Zijlstra <peterz@...radead.org>, mingo@...hat.com,
 juri.lelli@...hat.com, vincent.guittot@...aro.org, dietmar.eggemann@....com,
 rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
 bristot@...hat.com, vschneid@...hat.com, linux-kernel@...r.kernel.org,
 wuyun.abel@...edance.com, tglx@...utronix.de, efault@....de,
 tim.c.chen@...el.com, yu.c.chen.y@...il.com
Subject: Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to
 set request/slice suggestion

Hello Chenyu,

On 5/14/2024 2:48 PM, Chen Yu wrote:
>>> [..snip..]
>>>  /*
>>>   * Scan the LLC domain for idle CPUs; this is dynamically regulated by
>>>   * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
>>> @@ -7384,10 +7402,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
>>>  	if (sched_feat(SIS_UTIL)) {
>>>  		sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
>>>  		if (sd_share) {
>>> -			/* because !--nr is the condition to stop scan */> -			nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
>>> +			nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan));
>>>  			/* overloaded LLC is unlikely to have idle cpu/core */
>>> -			if (nr == 1)
>>> +			if (nr <= 0)
>>
>> I was wondering if this would preserve the current behavior with
>> SIS_FAST toggled off? Since the implementation below still does a
>> "--nr <= 0" , wouldn't it effectively visit one CPU less overall now?
>>
>> Have you tried something similar to the below hunk?
>>
>> 	/* because !--nr is the condition to stop scan */
>> 	nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan)) + 1;
>> 	if (nr == 1)
>> 		return -1;
>>
> 
> Yeah, right, to keep the scan depth consistent, the "+1" should be kept.
>  
>> I agree with Mike that looking at slice to limit scan-depth seems odd.
>> My experience with netperf is that the workload cares more about the
>> server-client being co-located on the closest cache domain and by
>> limiting scan-depth using slice, this is indirectly achieved since all
>> the wakeups carry the WF_SYNc flag.
>>
> 
> Exactly. This is the original motivation.
>  
>> P.S. have you tried using the slice in __select_idle_cpu()? Similar to
>> sched_idle_cpu() check, perhaps an additional sched_preempt_short_cpu()
>> which compares rq->curr->se.slice with the waking task's slice and
>> returs that cpu if SIS_SHORT can help run the workload quicker?
> 
> This is a good idea, it seems to be benefit PREEMPT_SHORT. If the customized
> task slice is introduced, we can leverage this hint for latency related
> optimization. Task wakeup is one thing, I can also think of other aspects,
> like idle load balance, etc. I'm not sure what is the proper usage of the
> task slice though, this is why I sent this RFC.
> 
>> Note:
>> This will not work if the SIS scan itself is the largest overhead in the
>> wakeup cycle and not the task placement itself. Previously during
>> SIS_UTIL testing, to measure the overheads of scan vs placement, we
>> would do a full scan but return the result that SIS_UTIL would have
>> returned to determine the overhead of the search itself.
>>
> 
> Regarding the task placement, do you mean the time between a task is enqueued
> and picked up? Do you have any recommendation which workload can expose the
> scan overhead most?

Sorry for not being clear here. From what I've observed in the past,
there are two dimensions to slect_idle_sibling():

i)  Placement: Final CPU select_idle_sibling() returns
ii) Search: Do we find an idle core/CPU in select_idle_sibling()

In case of netperf, I've observed that i) is more important than ii)
wherin a placement of client on same core/thread as that of the server
results in better performance vs finding an idle CPU on a remote LLC.
For hackbench/tbench, when runqueues are under high utilization (~75%),
reduction in search time ii) seems to be more beneficial.

There was also a wakeup from IPI / without IPI angle that I never quite
got to the bottom of that Mathieu has highlighted last year. I'll go
get some more data on that front and give your patch a try. Expect
results in a couple of days. 

Thank you.

> 
> thanks,
> Chenyu
>  
>>>  				return -1;
>>>  		}
>>>  	}
>>> [..snip..]
--
Thanks and Regards,
Prateek