linux-kernel - Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20100401122740.a1ae80a7.akpm@linux-foundation.org>
Date:	Thu, 1 Apr 2010 12:27:40 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Chase Douglas <chase.douglas@...onical.com>
Cc:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <peterz@...radead.org>,
	Thomas Gleixner <tglx@...utronix.de>,
	"Rafael J. Wysocki" <rjw@...k.pl>
Subject: Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting
 till after load update period

On Mon, 29 Mar 2010 09:41:12 -0400
Chase Douglas <chase.douglas@...onical.com> wrote:

> There's a period of 10 ticks where calc_load_tasks is updated by all the
> cpus for the load avg. Usually all the cpus do this during the first
> tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
> However, if they wake up calc_load_tasks is not incremented. Thus, if
> cpus go idle during the 10 tick period, calc_load_tasks may be
> decremented to a non-representative value. This issue can lead to
> systems having a load avg of exactly 0, even though the real load avg
> could theoretically be up to NR_CPUS.
> 
> This change defers calc_load_tasks accounting after each cpu updates the
> count until after the 10 tick period.
> 
> BugLink: http://bugs.launchpad.net/bugs/513848
> 

There was useful information in the [patch 0/1] email, such as the
offending commit ID.  If possible, it's best to avoid the [patch 0/n]
thing altogether - that information either has to be moved into the
[patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
lost.


> ---
>  kernel/sched.c |   16 ++++++++++++++--
>  1 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 9ab3cd7..c0aedac 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -3064,7 +3064,8 @@ void calc_global_load(void)
>   */
>  static void calc_load_account_active(struct rq *this_rq)
>  {
> -	long nr_active, delta;
> +	static atomic_long_t deferred;
> +	long nr_active, delta, deferred_delta;
>  
>  	nr_active = this_rq->nr_running;
>  	nr_active += (long) this_rq->nr_uninterruptible;
> @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
>  	if (nr_active != this_rq->calc_load_active) {
>  		delta = nr_active - this_rq->calc_load_active;
>  		this_rq->calc_load_active = nr_active;
> +
> +		/* Need to defer idle accounting during load update period: */
> +		if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
> +			     time_after_eq(jiffies, calc_load_update))) {
> +			atomic_long_add(delta, &deferred);
> +			return;
> +		}

That seems a sensible way to avoid the gross-looking "10 ticks" thing.

What was the reason for "update the avenrun load estimates 10 ticks
after the CPUs have updated calc_load_tasks"?  Can we do something
smarter there to fix this?

> +		deferred_delta = atomic_long_xchg(&deferred, 0);
> +		delta += deferred_delta;
> +
>  		atomic_long_add(delta, &calc_load_tasks);
>  	}
>  }

The global `deferred' is unfortunate from a design and possibly
scalability POV.  Can it be moved into the `struct rq'?  That way it
can become a plain old `unsigned long', too.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/