[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtBw6MLuJoEF09airBXOFTEciM5Bd-inu5q1aNkVn+HTng@mail.gmail.com>
Date: Tue, 22 Oct 2024 17:08:40 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Pierre Gondois <pierre.gondois@....com>
Cc: linux-kernel@...r.kernel.org, Hongyan Xia <hongyan.xia2@....com>,
Chritian Loehle <christian.loehle@....com>, Ingo Molnar <mingo@...hat.com>,
Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>,
Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>,
Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH 1/1] sched/fair: Update blocked averages on tick
On Mon, 21 Oct 2024 at 11:47, Pierre Gondois <pierre.gondois@....com> wrote:
>
> Hello Vincent,
>
> On 10/15/24 14:44, Vincent Guittot wrote:
> > On Fri, 11 Oct 2024 at 14:32, Pierre Gondois <pierre.gondois@....com> wrote:
> >>
> >> The Energy Aware Scheduler (EAS) relies on CPU/tasks utilization
> >> signals. On an idle CPU, the blocked load is updated during
> >> load balancing.
> >>
> >> sd->balance_interval increases with the number of CPUs in the domain.
> >> On an Arm DynamIQ system, sched domains containing CPUs with the same
> >> capacity do not exist. On a Pixel6 with 4 little, 2 mid, 2 big CPUs:
> >> - sd->min_interval = 8
> >> - sd->min_interval = 16
> >>
> >> The balance interval is doubled if the system is balanced, meaning
> >> that a balanced system will likely update blocked load every 16ms.
> >
> > The real max boundary is LOAD_AVG_PERIOD that is used to update
> > nohz.next_blocked. This is the max between 2 updates of blocked load.
> > The other ones are opportunistics updates when a normal load balance
> > is triggered.
>
> I wanted to mean that on an idle CPU with ticks still on, the cfs_rq load
> is only updated through this path:
> sched_balance_trigger() {
> if (time_after_eq(jiffies, rq->next_balance))
> raise_softirq(SCHED_SOFTIRQ);
> }
>
> ...
>
> sched_balance_softirq()
> \-sched_balance_update_blocked_averages()
>
> If the next_balance happens every 16ms, this means feec() might operate
> its task placement using an (up to) 16ms old util signal. The CPU might
This is true for all idle CPUs and not only the local one with tick
still firing when idle
> thus look busier than what it actually is.
yes and it can be up to 32 ms because the real max limit between 2
updates is currently set to LOAD_AVG_PERIOD. You can probably find a
unitary test on a board that takes advantage of some "random" update
during idle ticks but you will still have some old values on the
system and you don't have any control of their max period.
You could also reduce min_interval to a lower value but this will not
take care of other idle cpus
>
> >
> >>
> >> The find_energy_efficient_cpu() function might thus relies on outdated
> >> util signals to place tasks, leading to bad energy placement.
> >
> > Moving from 8ms to 16 ms is what makes the difference for you ?
>
> With this patch, the cfs_rq signal of an idle CPU is updated every tick,
> so every 4ms.
If the CPU is kept in the shallowest idle state then it is not
expected to stay for a long time which also means not that old
outdated value.
And what if the tick is 1ms or 10ms ?
>
> >
> > The LOAD_AVG_PERIOD mas period has been used as a default value but if
> > it's too long, we could consider changing the max period between 2
> > updates
> >
> >>
> >> Update blocked load on sched tick if:
> >> - the rq is idle
> >> - the load balancer will not be triggered.
> >>
> >> Signed-off-by: Pierre Gondois <pierre.gondois@....com>
> >> ---
> >> kernel/sched/fair.c | 21 ++++++++++++++++-----
> >> 1 file changed, 16 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 225b31aaee55..2f03bd10ac7a 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -9841,15 +9841,12 @@ static unsigned long task_h_load(struct task_struct *p)
> >> }
> >> #endif
> >>
> >> -static void sched_balance_update_blocked_averages(int cpu)
> >> +static void update_blocked_averages(struct rq *rq)
> >> {
> >> bool decayed = false, done = true;
> >> - struct rq *rq = cpu_rq(cpu);
> >> - struct rq_flags rf;
> >>
> >> - rq_lock_irqsave(rq, &rf);
> >> - update_blocked_load_tick(rq);
> >> update_rq_clock(rq);
> >> + update_blocked_load_tick(rq);
> >>
> >> decayed |= __update_blocked_others(rq, &done);
> >> decayed |= __update_blocked_fair(rq, &done);
> >> @@ -9857,6 +9854,18 @@ static void sched_balance_update_blocked_averages(int cpu)
> >> update_blocked_load_status(rq, !done);
> >> if (decayed)
> >> cpufreq_update_util(rq, 0);
> >> +}
> >> +
> >> +static void sched_balance_update_blocked_averages(int cpu)
> >> +{
> >> + struct rq *rq = cpu_rq(cpu);
> >> + struct cfs_rq *cfs_rq;
> >> + struct rq_flags rf;
> >> +
> >> + cfs_rq = &rq->cfs;
> >> +
> >> + rq_lock_irqsave(rq, &rf);
> >> + update_blocked_averages(rq);
> >> rq_unlock_irqrestore(rq, &rf);
> >> }
> >>
> >> @@ -12877,6 +12886,8 @@ void sched_balance_trigger(struct rq *rq)
> >>
> >> if (time_after_eq(jiffies, rq->next_balance))
> >> raise_softirq(SCHED_SOFTIRQ);
> >> + else if (idle_cpu(rq->cpu))
> >> + update_blocked_averages(rq);
> >
> > would be good to explain why you don't need rq lock here
>
> This is a mistake, the lock is indeed required.
update_blocked_averages() can take time as we go through all cgroups
and interrupts are still disabled here whereas they are not during
softirq. Some already complained that running
update_blocked_averages() can impact the system latency that is why we
moved the update out of schedler path for the newly idle case for
example.
IMO, you should keep the update in the softirq and take advantage if
this tick to update other idle CPUs which don't have idle tick.
>
> >
> > There is no rate limit so we can do this every tick (possibly 1ms)
> > when staying in shallowest state
>
> I'm not sure we understood each other as this patch should no be related
> to NOHZ CPUs. So please let me know if I used a wrong path as you said,
> or if a rate limit would be needed.
If your problem is about too old pelt value for idle CPUs because the
load balance interval and the next_block period are too high then you
should fix it for all idle CPUs with the next_block period probably
>
> >
> > So it's looks better to update the period between 2 update of blocked
> > load instead of adding a new path
> >
> >>
> >> nohz_balancer_kick(rq);
> >> }
> >> --
> >> 2.25.1
> >>
>
> Regards,
> Pierre
Powered by blists - more mailing lists