[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZkOODAALS9HQ3B9A@chenyu5-mobl2>
Date: Wed, 15 May 2024 00:15:08 +0800
From: Chen Yu <yu.c.chen@...el.com>
To: K Prateek Nayak <kprateek.nayak@....com>
CC: Peter Zijlstra <peterz@...radead.org>, <mingo@...hat.com>,
<juri.lelli@...hat.com>, <vincent.guittot@...aro.org>,
<dietmar.eggemann@....com>, <rostedt@...dmis.org>, <bsegall@...gle.com>,
<mgorman@...e.de>, <bristot@...hat.com>, <vschneid@...hat.com>,
<linux-kernel@...r.kernel.org>, <wuyun.abel@...edance.com>,
<tglx@...utronix.de>, <efault@....de>, <tim.c.chen@...el.com>,
<yu.c.chen.y@...il.com>
Subject: Re: [RFC][PATCH 10/10] sched/eevdf: Use sched_attr::sched_runtime to
set request/slice suggestion
On 2024-05-14 at 20:53:16 +0530, K Prateek Nayak wrote:
> Hello Chenyu,
>
> On 5/14/2024 2:48 PM, Chen Yu wrote:
> >>> [..snip..]
> >>> /*
> >>> * Scan the LLC domain for idle CPUs; this is dynamically regulated by
> >>> * comparing the average scan cost (tracked in sd->avg_scan_cost) against the
> >>> @@ -7384,10 +7402,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> >>> if (sched_feat(SIS_UTIL)) {
> >>> sd_share = rcu_dereference(per_cpu(sd_llc_shared, target));
> >>> if (sd_share) {
> >>> - /* because !--nr is the condition to stop scan */> - nr = READ_ONCE(sd_share->nr_idle_scan) + 1;
> >>> + nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan));
> >>> /* overloaded LLC is unlikely to have idle cpu/core */
> >>> - if (nr == 1)
> >>> + if (nr <= 0)
> >>
> >> I was wondering if this would preserve the current behavior with
> >> SIS_FAST toggled off? Since the implementation below still does a
> >> "--nr <= 0" , wouldn't it effectively visit one CPU less overall now?
> >>
> >> Have you tried something similar to the below hunk?
> >>
> >> /* because !--nr is the condition to stop scan */
> >> nr = adjust_idle_scan(p, READ_ONCE(sd_share->nr_idle_scan)) + 1;
> >> if (nr == 1)
> >> return -1;
> >>
> >
> > Yeah, right, to keep the scan depth consistent, the "+1" should be kept.
> >
> >> I agree with Mike that looking at slice to limit scan-depth seems odd.
> >> My experience with netperf is that the workload cares more about the
> >> server-client being co-located on the closest cache domain and by
> >> limiting scan-depth using slice, this is indirectly achieved since all
> >> the wakeups carry the WF_SYNc flag.
> >>
> >
> > Exactly. This is the original motivation.
> >
> >> P.S. have you tried using the slice in __select_idle_cpu()? Similar to
> >> sched_idle_cpu() check, perhaps an additional sched_preempt_short_cpu()
> >> which compares rq->curr->se.slice with the waking task's slice and
> >> returs that cpu if SIS_SHORT can help run the workload quicker?
> >
> > This is a good idea, it seems to be benefit PREEMPT_SHORT. If the customized
> > task slice is introduced, we can leverage this hint for latency related
> > optimization. Task wakeup is one thing, I can also think of other aspects,
> > like idle load balance, etc. I'm not sure what is the proper usage of the
> > task slice though, this is why I sent this RFC.
> >
> >> Note:
> >> This will not work if the SIS scan itself is the largest overhead in the
> >> wakeup cycle and not the task placement itself. Previously during
> >> SIS_UTIL testing, to measure the overheads of scan vs placement, we
> >> would do a full scan but return the result that SIS_UTIL would have
> >> returned to determine the overhead of the search itself.
> >>
> >
> > Regarding the task placement, do you mean the time between a task is enqueued
> > and picked up? Do you have any recommendation which workload can expose the
> > scan overhead most?
>
> Sorry for not being clear here. From what I've observed in the past,
> there are two dimensions to slect_idle_sibling():
>
> i) Placement: Final CPU select_idle_sibling() returns
> ii) Search: Do we find an idle core/CPU in select_idle_sibling()
>
I see.
> In case of netperf, I've observed that i) is more important than ii)
> wherin a placement of client on same core/thread as that of the server
> results in better performance vs finding an idle CPU on a remote LLC.
How about placement of client on same core/thread vs finding an idle CPU
on a local LLC?
> For hackbench/tbench, when runqueues are under high utilization (~75%),
> reduction in search time ii) seems to be more beneficial.
>
I can understand that hackbench is idle-cpu sensitive because it is
MxN wakeup relationship and could result in task stacking easily. While
for tbench, it should be similar to netperf, not sure why it does not
fall into i) > ii) case. Is it because you were testing netperf TCP_RR(full-duplex)
and tbench is half-duplex(the former has stronger cache locality requirement)?
> There was also a wakeup from IPI / without IPI angle that I never quite
> got to the bottom of that Mathieu has highlighted last year. I'll go
> get some more data on that front and give your patch a try. Expect
> results in a couple of days.
Sure, I'd be glad to try your patch.
thanks,
Chenyu
Powered by blists - more mailing lists