linux-kernel - Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till after load update period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <h2s40ec3ea41004011300zaa5699f0kcf07434fec9b1aab@mail.gmail.com>
Date:	Thu, 1 Apr 2010 16:00:24 -0400
From:	Chase Douglas <chase.douglas@...onical.com>
To:	Thomas Gleixner <tglx@...utronix.de>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...e.hu>,
	Peter Zijlstra <peterz@...radead.org>,
	"Rafael J. Wysocki" <rjw@...k.pl>,
	kernel-team <kernel-team@...ts.ubuntu.com>
Subject: Re: [REGRESSION 2.6.30][PATCH 1/1] sched: defer idle accounting till 
	after load update period

On Thu, Apr 1, 2010 at 3:37 PM, Thomas Gleixner <tglx@...utronix.de> wrote:
> On Thu, 1 Apr 2010, Andrew Morton wrote:
>> On Mon, 29 Mar 2010 09:41:12 -0400
>> Chase Douglas <chase.douglas@...onical.com> wrote:
>>
>> > There's a period of 10 ticks where calc_load_tasks is updated by all the
>> > cpus for the load avg. Usually all the cpus do this during the first
>> > tick. If any cpus go idle, calc_load_tasks is decremented accordingly.
>> > However, if they wake up calc_load_tasks is not incremented. Thus, if
>> > cpus go idle during the 10 tick period, calc_load_tasks may be
>> > decremented to a non-representative value. This issue can lead to
>> > systems having a load avg of exactly 0, even though the real load avg
>> > could theoretically be up to NR_CPUS.
>> >
>> > This change defers calc_load_tasks accounting after each cpu updates the
>> > count until after the 10 tick period.
>> >
>> > BugLink: http://bugs.launchpad.net/bugs/513848
>> >
>>
>> There was useful information in the [patch 0/1] email, such as the
>> offending commit ID.  If possible, it's best to avoid the [patch 0/n]
>> thing altogether - that information either has to be moved into the
>> [patch 1/n] changelog by someone (ie: me), or it just gets ommitted and
>> lost.
>>
>>
>> > ---
>> >  kernel/sched.c |   16 ++++++++++++++--
>> >  1 files changed, 14 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/kernel/sched.c b/kernel/sched.c
>> > index 9ab3cd7..c0aedac 100644
>> > --- a/kernel/sched.c
>> > +++ b/kernel/sched.c
>> > @@ -3064,7 +3064,8 @@ void calc_global_load(void)
>> >   */
>> >  static void calc_load_account_active(struct rq *this_rq)
>> >  {
>> > -   long nr_active, delta;
>> > +   static atomic_long_t deferred;
>> > +   long nr_active, delta, deferred_delta;
>> >
>> >     nr_active = this_rq->nr_running;
>> >     nr_active += (long) this_rq->nr_uninterruptible;
>> > @@ -3072,6 +3073,17 @@ static void calc_load_account_active(struct rq *this_rq)
>> >     if (nr_active != this_rq->calc_load_active) {
>> >             delta = nr_active - this_rq->calc_load_active;
>> >             this_rq->calc_load_active = nr_active;
>> > +
>> > +           /* Need to defer idle accounting during load update period: */
>> > +           if (unlikely(time_before(jiffies, this_rq->calc_load_update) &&
>> > +                        time_after_eq(jiffies, calc_load_update))) {
>> > +                   atomic_long_add(delta, &deferred);
>> > +                   return;
>> > +           }
>>
>> That seems a sensible way to avoid the gross-looking "10 ticks" thing.
>>
>> What was the reason for "update the avenrun load estimates 10 ticks
>> after the CPUs have updated calc_load_tasks"?  Can we do something
>> smarter there to fix this?
>
> The reason was to make sure that all cpus have updated their stuff
> halfways before we start to calculate and we spread out stuff among
> cpus to avoid contention on those global atomic variables.
>
> Yes, we can do something smarter. First thing is to fix the
> nr_uninterruptible accounting which can become negative. Peter looked
> into that already and we really need to fix this first before doing
> anything smart about that load avg stuff.

I noticed this too, and figured it was some clever accounting :). If
this is a real bug, I'd be happy to take a look at it as well.

What do you have in mind as a smarter way of fixing this issue, and
why is it dependent on fixing the absolute values of each runqueue's
nr_uninterruptible count?

-- Chase
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/