[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170215153056.GB6691@lerouge>
Date: Wed, 15 Feb 2017 16:30:58 +0100
From: Frederic Weisbecker <fweisbec@...il.com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Matt Fleming <matt@...eblueprint.co.uk>,
Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
Mike Galbraith <umgwanakikbuti@...il.com>,
Morten Rasmussen <morten.rasmussen@....com>,
stable@...r.kernel.org,
Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed
NO_HZ accounting
On Thu, Feb 09, 2017 at 04:07:53PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> > The calculation for the next sample window when exiting NOH_HZ idle
> > does not handle the fact that we may not have reached the next sample
> > window yet, i.e. that we came out of idle between sample windows.
> >
> > If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> > window time when we want idle but before the next sample window, we
> > will add an unnecessary LOAD_FREQ delay to the load average
> > accounting, delaying any update for potentially ~9seconds.
> >
> > This can result in huge spikes in the load average values due to
> > per-cpu uninterruptible task counts being out of sync when accumulated
> > across all CPUs.
> >
> > It's safe to update the per-cpu active count if we wake between sample
> > windows because any load that we left in 'calc_load_idle' will have
> > been zero'd when the idle load was folded in calc_global_load().
>
> Right, so differently put; the problem is that we check against the
> 'stale' rq->calc_load_update, while the current and effective period
> boundary is 'calc_load_update'.
>
> So, when rq->calc_load_update < jiffies < calc_load_update, we end up
> setting the next-update to calc_load_update+LOAD_FREQ, where it should
> have been calc_load_update.
>
> > diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> > index a2d6eb71f06b..a7a6f3646970 100644
> > --- a/kernel/sched/loadavg.c
> > +++ b/kernel/sched/loadavg.c
>
> > @@ -210,10 +211,16 @@ void calc_load_exit_idle(void)
> > * We woke inside or after the sample window, this means we're already
> > * accounted through the nohz accounting, so skip the entire deal and
> > * sync up for the next window.
> > + *
> > + * The next window is 'calc_load_update' if we haven't reached it yet,
> > + * and 'calc_load_update + 10' if we're inside the current window.
Hmm, the comment doesn't seem to match the code.
> > */
> > + next_window = calc_load_update;
> > +
> > + if (time_in_range_open(jiffies, next_window, next_window + 10)
> > + next_window += LOAD_FREQ;
> > +
> > + this_rq->calc_load_update = next_window;
> > }
>
> So I don't much like the time_in_range_open() thing. The simpler patch
> which you tested to also work was:
>
> diff --git a/kernel/sched/loadavg.c b/kernel/sched/loadavg.c
> index 7296b7308eca..cfb47bd0ee50 100644
> --- a/kernel/sched/loadavg.c
> +++ b/kernel/sched/loadavg.c
> @@ -201,6 +201,8 @@ void calc_load_exit_idle(void)
> {
> struct rq *this_rq = this_rq();
>
> + this_rq->calc_load_update = calc_load_update;
> +
> /*
> * If we're still before the sample window, we're done.
> */
> @@ -212,7 +214,6 @@ void calc_load_exit_idle(void)
> * accounted through the nohz accounting, so skip the entire deal and
> * sync up for the next window.
> */
> - this_rq->calc_load_update = calc_load_update;
> if (time_before(jiffies, this_rq->calc_load_update + 10))
> this_rq->calc_load_update += LOAD_FREQ;
> }
>
> But the problem there is that we unconditionally issue that store. Now
> I've no idea how much of a problem that is, and it certainly is the
> simplest form (+- comments that need updating), so maybe that makes
> sense.
Well, that version looks fine to me.
Thanks.
Powered by blists - more mailing lists