[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtCEAVP88Y6szo84H6YeN7TXs6JSs8eKrToV-epL+Q3-=w@mail.gmail.com>
Date: Wed, 22 Sep 2021 17:05:19 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Mel Gorman <mgorman@...hsingularity.net>
Cc: Mike Galbraith <efault@....de>,
Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...nel.org>,
Valentin Schneider <valentin.schneider@....com>,
Aubrey Li <aubrey.li@...ux.intel.com>,
Barry Song <song.bao.hua@...ilicon.com>,
Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 2/2] sched/fair: Scale wakeup granularity relative to nr_running
On Wed, 22 Sept 2021 at 16:15, Vincent Guittot
<vincent.guittot@...aro.org> wrote:
>
> On Wed, 22 Sept 2021 at 15:20, Mel Gorman <mgorman@...hsingularity.net> wrote:
> >
> > On Wed, Sep 22, 2021 at 07:22:20AM +0200, Mike Galbraith wrote:
> > > On Tue, 2021-09-21 at 11:36 +0100, Mel Gorman wrote:
> > > > On Tue, Sep 21, 2021 at 05:52:32AM +0200, Mike Galbraith wrote:
> > > >
> > > >
> > > > > Preemption does rapidly run into diminishing return as load climbs for
> > > > > a lot of loads, but as you know, it's a rather sticky wicket because
> > > > > even when over-committed, preventing light control threads from slicing
> > > > > through (what can be a load's own work crew of) hogs can seriously
> > > > > injure performance.
> > > > >
> > > >
> > > > Turning this into a classic Rob Peter To Pay Paul problem. We don't know
> > > > if there is a light control thread that needs to run or not that affects
> > > > overall performance. It all depends on whether that control thread needs
> > > > to make progress for the overall workload or whether there are a mix of
> > > > workloads resulting in overloading.
> > >
> > > WRT overload, and our good buddies Peter and Paul :) I added...
> > > if (gran >= sysctl_sched_latency >> 1)
> > > trace_printk("runnable:%d preempt disabled\n",cfs_rq->nr_running);
> > > ...to watch, and met the below when I.. logged in.
> > >
> > > homer:..debug/tracing # tail -20 trace
> > > X-2229 [002] d..5. 60.690322: wakeup_gran: runnable:9 preempt disabled
> > > X-2229 [002] d..5. 60.690325: wakeup_gran: runnable:10 preempt disabled
> > > X-2229 [002] d..5. 60.690330: wakeup_gran: runnable:11 preempt disabled
> > > X-2229 [002] d..5. 60.690363: wakeup_gran: runnable:13 preempt disabled
> > > X-2229 [002] d..5. 60.690377: wakeup_gran: runnable:14 preempt disabled
> > > X-2229 [002] d..5. 60.690390: wakeup_gran: runnable:15 preempt disabled
> > > X-2229 [002] d..5. 60.690404: wakeup_gran: runnable:16 preempt disabled
> > > X-2229 [002] d..5. 60.690425: wakeup_gran: runnable:9 preempt disabled
> > > ksmserver-2694 [003] d..3. 60.690432: wakeup_gran: runnable:6 preempt disabled
> > > ksmserver-2694 [003] d..3. 60.690436: wakeup_gran: runnable:7 preempt disabled
> > > X-2229 [002] d..5. 60.690451: wakeup_gran: runnable:6 preempt disabled
> > > X-2229 [002] d..5. 60.690465: wakeup_gran: runnable:7 preempt disabled
> > > kmix-2736 [000] d..3. 60.690491: wakeup_gran: runnable:6 preempt disabled
> > > X-2229 [004] d..5. 92.889635: wakeup_gran: runnable:6 preempt disabled
> > > X-2229 [004] d..5. 92.889675: wakeup_gran: runnable:6 preempt disabled
> > > X-2229 [004] d..5. 92.889863: wakeup_gran: runnable:6 preempt disabled
> > > X-2229 [004] d..5. 92.889944: wakeup_gran: runnable:6 preempt disabled
> > > X-2229 [004] d..5. 92.889957: wakeup_gran: runnable:7 preempt disabled
> > > X-2229 [004] d..5. 92.889968: wakeup_gran: runnable:8 preempt disabled
> > > QXcbEventQueue-2740 [000] d..4. 92.890025: wakeup_gran: runnable:6 preempt disabled
> > > homer:..debug/tracing
> > >
> > > Watching 'while sleep 1; do clear;tail trace; done' with nothing but a
> > > kbuild running is like watching top. There's enough stacking during
> > > routine use of my desktop box that it runs into the tick granularity
> > > wall pretty much continuously, so 'overload' may want redefining.
> > >
> >
> > Ok, that's pretty convincing. You didn't mention if there were
> > interactivity glitches but it's possible. This is what I'm currently
> > testing but have no results for yet. It caps wakeup_gran at
> > sysctl_sched_latency.
> >
> > ---8<---
> > sched/fair: Scale wakeup granularity relative to nr_running
> >
> > Commit 8a99b6833c88 ("sched: Move SCHED_DEBUG sysctl to debugfs") moved
> > the kernel.sched_wakeup_granularity_ns sysctl under debugfs. One of the
> > reasons why this sysctl may be used may be for "optimising for throughput",
> > particularly when overloaded. The tool TuneD sometimes alters this for two
> > profiles e.g. "mssql" and "throughput-performance". At least version 2.9
> > does but it changed in master where it also will poke at debugfs instead.
> >
> > Internal parameters like sysctl_sched_wakeup_granularity are scaled
> > based on the number of CPUs due to sysctl_sched_tunable_scaling. For
> > simplicity, the timing figures in this changelog are based on
> > SCHED_TUNABLESCALING_NONE.
>
> This is a bit misleading because the platform that you used to
> highlight the problem has a 7ms sysctl_sched_wakeup_granularity. which
> is far more than your tick which should be 1ms
>
> >
> > During task migration or wakeup, a decision is made on whether
> > to preempt the current task or not. To limit over-scheduled,
> > sysctl_sched_wakeup_granularity delays the preemption to allow at least 1ms
> > of runtime before preempting. However, when a domain is heavily overloaded
> > (e.g. hackbench), the degree of over-scheduling is still severe. This is
> > problematic as time is wasted rescheduling tasks that could instead be
> > used by userspace tasks.
> >
> > However, care must be taken. Even if a system is overloaded, there may
> > be high priority threads that must still be able to run. Mike Galbraith
> > explained the contraints as follows;
> >
> > CFS came about because the O1 scheduler was unfair to the
> > point it had starvation problems. People pretty much across the
> > board agreed that a fair scheduler was a much way better way
> > to go, and CFS was born. It didn't originally have the sleep
> > credit business, but had to grow it to become _short term_ fair.
> > Ingo cut the sleep credit in half because of overscheduling, and
> > that has worked out pretty well all told.. but now you're pushing
> > it more in the unfair direction, all the way to extremely unfair
> > for anything and everything very light.
> >
> > Fairness isn't the holy grail mind you, and at some point, giving
> > up on short term fairness certainly isn't crazy, as proven by your
> > hackbench numbers and other numbers we've seen over the years,
> > but taking bites out of the 'CF' in the CFS that was born to be a
> > corner-case killer is.. worrisome. The other shoe will drop.. it
> > always does :)
> >
> > This patch scales the wakeup granularity based on the number of running
> > tasks on the CPU relative to
> >
> > sched_nr_disable_gran = sysctl_sched_latency / sysctl_sched_wakeup_granularity
> >
> > By default, this will allow the wakeup_gran to scale from
> > sysctl_sched_wakeup_granularity up to sysctl_sched_wakeup_granularity up to
> > sysctl_sched_latency depending on the number of running tasks on a cfs_rq.
> > By default, the limit is 6ms.
> >
> > Note that the TuneD throughput-performance profile allows up to 15ms
> > for sysctl_sched_latency (ignoring scaling) but there is no explanation
> > why such a long period was necessary or why sched_latency_ns is also
> > not adjusted. The intent may have been to disable wakeup preemption
> > or it might be an oversight. An internet search for instances where
> > sysctl_sched_wakeup_granularity parameter are tuned to high values offer
> > either no explanation or a broken one.
> >
> > TODO: Results positive or negative
> >
> > Signed-off-by: Mel Gorman <mgorman@...hsingularity.net>
> > ---
> > kernel/sched/fair.c | 64 ++++++++++++++++++++++++++++++++++++++++++-------
> > kernel/sched/features.h | 6 +++++
> > 2 files changed, 61 insertions(+), 9 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ff69f245b939..5ec3b12039d6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -84,6 +84,14 @@ static unsigned int normalized_sysctl_sched_wakeup_granularity = 1000000UL;
> >
> > const_debug unsigned int sysctl_sched_migration_cost = 500000UL;
> >
> > +/*
> > + * This value is kept at sysctl_sched_latency / sysctl_sched_wakeup_granularity
> > + *
> > + * This influences the decision on whether a waking task can preempt a running
> > + * task.
> > + */
> > +static unsigned int sched_nr_disable_gran = 6;
> > +
> > int sched_thermal_decay_shift;
> > static int __init setup_sched_thermal_decay_shift(char *str)
> > {
> > @@ -627,6 +635,9 @@ int sched_update_scaling(void)
> > sched_nr_latency = DIV_ROUND_UP(sysctl_sched_latency,
> > sysctl_sched_min_granularity);
> >
> > + sched_nr_disable_gran = DIV_ROUND_UP(sysctl_sched_latency,
> > + sysctl_sched_wakeup_granularity);
> > +
> > #define WRT_SYSCTL(name) \
> > (normalized_sysctl_##name = sysctl_##name / (factor))
> > WRT_SYSCTL(sched_min_granularity);
> > @@ -4511,7 +4522,8 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > }
> >
> > static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se);
> > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > + struct sched_entity *se);
> >
> > /*
> > * Pick the next process, keeping these things in mind, in this order:
> > @@ -4550,16 +4562,16 @@ pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > second = curr;
> > }
> >
> > - if (second && wakeup_preempt_entity(second, left) < 1)
> > + if (second && wakeup_preempt_entity(NULL, second, left) < 1)
>
> Why not applying the same policy here ? the tick can also prevent
> current task to move forward
>
> > se = second;
> > }
> >
> > - if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1) {
> > + if (cfs_rq->next && wakeup_preempt_entity(NULL, cfs_rq->next, left) < 1) {
> > /*
> > * Someone really wants this to run. If it's not unfair, run it.
> > */
> > se = cfs_rq->next;
> > - } else if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1) {
> > + } else if (cfs_rq->last && wakeup_preempt_entity(NULL, cfs_rq->last, left) < 1) {
> > /*
> > * Prefer last buddy, try to return the CPU to a preempted task.
> > */
> > @@ -7044,9 +7056,42 @@ balance_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> > }
> > #endif /* CONFIG_SMP */
> >
> > -static unsigned long wakeup_gran(struct sched_entity *se)
> > +static unsigned long
> > +select_wakeup_gran(struct cfs_rq *cfs_rq)
> > +{
> > + unsigned int nr_running, threshold;
> > +
> > + if (!cfs_rq || !sched_feat(SCALE_WAKEUP_GRAN))
> > + return sysctl_sched_wakeup_granularity;
> > +
> > + /* !GENTLE_FAIR_SLEEPERS has one overload threshold. */
> > + if (!sched_feat(GENTLE_FAIR_SLEEPERS)) {
> > + if (cfs_rq->nr_running <= sched_nr_disable_gran)
>
> cfs_rq->nr_running reflects the number of sched entities in the cfs
> but not the number of running tasks which reflected in h_nr_running
>
> Also do you want to take into account only tasks in this cfs and its
> children or on all cfs on this rq ?
>
> > + return sysctl_sched_wakeup_granularity;
> > +
> > + return sysctl_sched_latency;
> > + }
> > +
> > + /* GENTLE_FAIR_SLEEPER has two overloaded thresholds. */
> > + nr_running = cfs_rq->nr_running;
> > + threshold = sched_nr_disable_gran >> 1;
> > +
> > + /* No overload. */
> > + if (nr_running <= threshold)
> > + return sysctl_sched_wakeup_granularity;
>
> TBH I don't like these "no overload", "light overload" ... They don't
> have any real meaning apart from that it might work for your platform
> and your hackbench test.
> We already had have people complaining that small cfs task does not
> preempt fast enough curr task as an example
>
> There is no explanation why these values are the correct ones and not
> but are just some random heuristic results and we are trying to remove
> platform heuristic and to not add new
>
> > +
> > + /* Light overload. */
> > + if (nr_running <= sched_nr_disable_gran)
> > + return sysctl_sched_latency >> 1;
> > +
> > + /* Heavy overload. */
> > + return sysctl_sched_latency;
>
> Why should a thread without any relationship with the curr, not
> preempt it because there are already a lot of running threads ? In
> your case, you want hackbench threads to not preempt each others
> because they tries to use same resources so it's probably better to
> let the current one to move forward but that's not a universal policy.
>
> side question: Have you try to change the nice priority which also
> impact whether a thread can preempt curr ?
There were also some discussions about adding a nice_latency prio so a
task/group can be set if they want to be nice with others from a
scheduling latency pov. One open point was to define what it mean to
be nice from a latency pov
>
> > +}
> > +
> > +static unsigned long
> > +wakeup_gran(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > - unsigned long gran = sysctl_sched_wakeup_granularity;
> > + unsigned long gran = select_wakeup_gran(cfs_rq);
> >
> > /*
> > * Since its curr running now, convert the gran from real-time
> > @@ -7079,14 +7124,15 @@ static unsigned long wakeup_gran(struct sched_entity *se)
> > *
> > */
> > static int
> > -wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
> > +wakeup_preempt_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr,
> > + struct sched_entity *se)
> > {
> > s64 gran, vdiff = curr->vruntime - se->vruntime;
> >
> > if (vdiff <= 0)
> > return -1;
> >
> > - gran = wakeup_gran(se);
> > + gran = wakeup_gran(cfs_rq, se);
> > if (vdiff > gran)
> > return 1;
> >
> > @@ -7191,7 +7237,7 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
> > return;
> >
> > update_curr(cfs_rq_of(se));
> > - if (wakeup_preempt_entity(se, pse) == 1) {
> > + if (wakeup_preempt_entity(cfs_rq, se, pse) == 1) {
>
> like for update_curr above, cfs_rq can be wrong because se could have changed
>
> > /*
> > * Bias pick_next to pick the sched entity that is
> > * triggering this preemption.
> > diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> > index 7f8dace0964c..d041d7023029 100644
> > --- a/kernel/sched/features.h
> > +++ b/kernel/sched/features.h
> > @@ -95,3 +95,9 @@ SCHED_FEAT(LATENCY_WARN, false)
> >
> > SCHED_FEAT(ALT_PERIOD, true)
> > SCHED_FEAT(BASE_SLICE, true)
> > +
> > +/*
> > + * Scale sched_wakeup_granularity dynamically based on the number of running
> > + * tasks up to a cap of sysctl_sched_latency.
> > + */
> > +SCHED_FEAT(SCALE_WAKEUP_GRAN, true)
Powered by blists - more mailing lists