[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1339429374.30462.54.camel@twins>
Date: Mon, 11 Jun 2012 17:42:54 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Charles Wang <muming.wq@...il.com>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Charles Wang <muming.wq@...bao.com>, Tao Ma <tm@....ma>
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate
On Sat, 2012-06-09 at 18:54 +0800, Charles Wang wrote:
> After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
> the idle into calc_load_tasks_idle between the last cpu load calculating and
> calc_global_load calling. However problem still exits between the first cpu
> load calculating and the last cpu load calculating. Every time when we do load
> calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
> the idle load is caused by calculated cpus. This problem is also described in
> the following link:
>
> https://lkml.org/lkml/2012/5/24/419
Don't put links in like this, if its useful describe it. As it stands I
think your paragraph and that link are trying to say the same.
Also, since you found Tao's email, why didn't you CC him?
I find it hard to understand either description.
Are you in fact saying:
cpu0 cpu1 cpu2
calc_load_account_active()
calc_load_account_active()
-> idle
,---- calc_load_account_idle()
|
| calc_load_account_active()
`---------> calc_load_fold_idle()
That when a cpu goes idle during the per-cpu active folding window
it might happen that a cpu goes idle after its fold, but before a next
cpu's fold, so that its calc_load_tasks_idle contribution gets folded
back in. As per the above picture.
Now _why_ is this a problem?
All this code tries to do is:
nr_active = 0;
for_each_possible_cpu(cpu)
nr_active += cpu_rq(cpu)->nr_running + cpu_rq(cpu)
Without actually iterating all cpus because on large systems that's
prohibitively expensive.
So instead we do the calc_load_fold_active() thing on each individual
cpu:
nr_active = this_rq->nr_running + this_rq->nr_uninterruptible;
delta = nr_active - this_rq->calc_load_active;
this_rq->calc_load_active = nr_active;
atomic_long_add(delta, &calc_load_tasks);
The first does a straight sum, the second does a sum of deltas, both
should give the same number.
\Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
= \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
The whole NO_HZ thing complicates this a little further in order to not
have to wake an idle cpu. I'll try and do the above sum with the nohz
thing in.. maybe something obvious falls out.
Now all this is sampling.. we take a nr_active sample once every
LOAD_FREQ (5s). Anything that happens inside this doesn't exist.
So how is the above scenario different from the cpu going idle right
before we take the sample?
> This bug can be found in our work load. The average running processes number
> is about 15, but the load only shows about 4.
Right,. still not really sure what the bug is though.
> The patch provides a solution, by taking calculated load cpus' idle away from
> real effective idle.
-ENOPARSE
> First adds a cpumask to record those cpus that alread
> calculated their load, and then adds a calc_unmask_cpu_load_idle to record
> thoses not marked cpus' go-idle load. Calc_unmask_cpu_load_idle takes place
> of calc_load_tasks_idle to be added into calc_load_tasks every 5HZ
5 seconds
> when cpu
> calculate its load. Go-idle load on those cpus which load alread has been calculated
> will only be added into calc_load_tasks_idle, no in calc_unmask_cpu_load_idle.
I don't like the solution, keeping that mask is expensive. Also, I still
don't understand what the problem is.
A few nits on the patch below..
> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
> kernel/time/timekeeping.c | 1 +
> 3 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6029d8c..a2b8df2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -145,6 +145,7 @@ extern unsigned long this_cpu_load(void);
>
>
> extern void calc_global_load(unsigned long ticks);
> +extern void prepare_idle_mask(unsigned long ticks);
> extern void update_cpu_load_nohz(void);
>
> extern unsigned long get_parent_ip(unsigned long addr);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c46958e..bdfe3c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2164,6 +2164,7 @@ unsigned long this_cpu_load(void)
> /* Variables and functions for calc_load */
> static atomic_long_t calc_load_tasks;
> static unsigned long calc_load_update;
> +static unsigned long idle_mask_update;
> unsigned long avenrun[3];
> EXPORT_SYMBOL(avenrun);
>
> @@ -2199,13 +2200,38 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
> */
> static atomic_long_t calc_load_tasks_idle;
>
> +/*
> + * Those cpus whose load alread has been calculated in this LOAD_FREQ
> + * period will be masked.
> + */
> +struct cpumask cpu_load_update_mask;
This should be a cpumask_var_t and allocated somewhere..
> +
> +/*
> + * Fold unmask cpus' idle load
> + */
> +static atomic_long_t calc_unmask_cpu_load_idle;
> +
> +
> void calc_load_account_idle(struct rq *this_rq)
> {
> long delta;
> + int cpu = smp_processor_id();
> +
>
> delta = calc_load_fold_active(this_rq);
> if (delta)
> + {
wrong bracket placement
> atomic_long_add(delta, &calc_load_tasks_idle);
> + /*
> + * calc_unmask_cpu_load_idle is only used between the first cpu load accounting
> + * and the last cpu load accounting in every LOAD_FREQ period, and records idle load on
> + * those unmask cpus.
> + */
> + if (!cpumask_empty(&cpu_load_update_mask) && !cpumask_test_cpu(cpu, &cpu_load_update_mask))
> + {
> + atomic_long_add(delta, &calc_unmask_cpu_load_idle);
lines are too long, brackets are placed wrong and brackets are
superfluous.
> + }
> + }
> }
>
and many more similar nits.
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 6e46cac..afbc06a 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1222,6 +1222,7 @@ void do_timer(unsigned long ticks)
> jiffies_64 += ticks;
> update_wall_time();
> calc_global_load(ticks);
> + prepare_idle_mask(ticks);
Why not add this to calc_global_load() ?
> }
>
> /**
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists