linux-kernel - Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed NO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170215151210.GA6691@lerouge>
Date:   Wed, 15 Feb 2017 16:12:11 +0100
From:   Frederic Weisbecker <fweisbec@...il.com>
To:     Matt Fleming <matt@...eblueprint.co.uk>
Cc:     Peter Zijlstra <peterz@...radead.org>,
        Ingo Molnar <mingo@...nel.org>, linux-kernel@...r.kernel.org,
        Mike Galbraith <umgwanakikbuti@...il.com>,
        Morten Rasmussen <morten.rasmussen@....com>,
        stable@...r.kernel.org,
        Vincent Guittot <vincent.guittot@...aro.org>
Subject: Re: [PATCH] sched/loadavg: Avoid loadavg spikes caused by delayed
 NO_HZ accounting

On Wed, Feb 08, 2017 at 01:29:24PM +0000, Matt Fleming wrote:
> The calculation for the next sample window when exiting NOH_HZ idle
> does not handle the fact that we may not have reached the next sample
> window yet

That sentence is hard to parse, it took me some time to figure out that
those two "next sample window" may not refer to the same thing.

Maybe it would be clearer with something along the lines of:

"The calculation for the next sample window when exiting NO_HZ
 does not handle the fact that we may not have crossed any sample
 window during the NO_HZ period."

> If we wake from NO_HZ idle after the pending this_rq->calc_load_update
> window time when we want idle but before the next sample window

That too was hard to understand. How about:

"If we enter in NO_HZ mode after a pending this_rq->calc_load_update
 and we exit from NO_HZ mode before the forthcoming sample window, ..."

> we will add an unnecessary LOAD_FREQ delay to the load average
> accounting, delaying any update for potentially ~9seconds.
> 
> This can result in huge spikes in the load average values due to
> per-cpu uninterruptible task counts being out of sync when accumulated
> across all CPUs.
> 
> It's safe to update the per-cpu active count if we wake between sample
> windows because any load that we left in 'calc_load_idle' will have
> been zero'd when the idle load was folded in calc_global_load().
> 
> This issue is easy to reproduce before,
> 
>   commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
> 
> just by forking short-lived process pipelines built from ps(1) and
> grep(1) in a loop. I'm unable to reproduce the spikes after that
> commit, but the bug still seems to be present from code review.
> 
> Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Mike Galbraith <umgwanakikbuti@...il.com>
> Cc: Morten Rasmussen <morten.rasmussen@....com>
> Cc: Vincent Guittot <vincent.guittot@...aro.org>
> Cc: <stable@...r.kernel.org> # v3.5+
> Signed-off-by: Matt Fleming <matt@...eblueprint.co.uk>

I'll comment the change on Peter's proposition.

Thanks!