linux-kernel - Re: [PATCH] sched: Folding nohz load accounting more accurate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 12 Jun 2012 17:34:10 +0800
From:	Charles Wang <muming.wq@...il.com>
To:	Peter Zijlstra <peterz@...radead.org>
CC:	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
	Charles Wang <muming.wq@...bao.com>, Tao Ma <tm@....ma>,
	含黛 <handai.szj@...bao.com>
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

OK! Configure the Thunderbird as Documentation/email-clients.txt saying.
Sorry for last email contained with html. No html now.

On Monday, June 11, 2012 11:42 PM, Peter Zijlstra wrote:

> On Sat, 2012-06-09 at 18:54 +0800, Charles Wang wrote:
>> After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
>> the idle into calc_load_tasks_idle between the last cpu load calculating and
>> calc_global_load calling. However problem still exits between the first cpu 
>> load calculating and the last cpu load calculating. Every time when we do load 
>> calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
>> the idle load is caused by calculated cpus. This problem is also described in
>> the following link:
>>
>> https://lkml.org/lkml/2012/5/24/419
> 
> Don't put links in like this, if its useful describe it. As it stands I
> think your paragraph and that link are trying to say the same.
> 
> Also, since you found Tao's email, why didn't you CC him?
> 

My mistake! And this problem is first reported by Sha Zhengju
<handai.szj@...bao.com>, so i also add her in.

> I find it hard to understand either description.
> 
> Are you in fact saying:
> 
> 
> cpu0		cpu1		cpu2
> 
> calc_load_account_active()
> 
> 		calc_load_account_active()
> 		-> idle
> 	,----	   calc_load_account_idle()
> 	|
> 	|			calc_load_account_active()
> 	`--------->		  calc_load_fold_idle()
> 
> 
> That when a cpu goes idle during the per-cpu active folding window
> it might happen that a cpu goes idle after its fold, but before a next
> cpu's fold, so that its calc_load_tasks_idle contribution gets folded
> back in. As per the above picture.
> 
> Now _why_ is this a problem? 
> 

consider following case:

5HZ+1
| cpu0_load	cpu1	cpu2	cpu3	calc_load_tasks
|    1		 1	 1	 1	
|  -->calc_load				    1
|    1		 1	 1	 1	
|       	     -->calc_load	    2
|    0		 0	 1	 0
|		     -->calc_load	    2+1-3=1
|    1		 1	 0	 1	
|			     -->calc_load   1-1=0
V
5HZ+11     -->calc_global_load              0



actually the load should be around 3, but shows nearly 0.

1 tick is much long for some workloads.

> All this code tries to do is:
> 
>   nr_active = 0;
>   for_each_possible_cpu(cpu)
> 	nr_active += cpu_rq(cpu)->nr_running + cpu_rq(cpu)
> 
> Without actually iterating all cpus because on large systems that's
> prohibitively expensive.
> 
> So instead we do the calc_load_fold_active() thing on each individual
> cpu:
> 
>   nr_active = this_rq->nr_running + this_rq->nr_uninterruptible;
>   delta = nr_active - this_rq->calc_load_active;
>   this_rq->calc_load_active = nr_active;
> 
>   atomic_long_add(delta, &calc_load_tasks);
> 
> The first does a straight sum, the second does a sum of deltas, both
> should give the same number.
> 
> 
>   \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
>                 = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
> 
> The whole NO_HZ thing complicates this a little further in order to not
> have to wake an idle cpu. I'll try and do the above sum with the nohz
> thing in.. maybe something obvious falls out.
> 
> Now all this is sampling.. we take a nr_active sample once every
> LOAD_FREQ (5s). Anything that happens inside this doesn't exist.
> 
> So how is the above scenario different from the cpu going idle right
> before we take the sample?
> 
>> This bug can be found in our work load. The average running processes number 
>> is about 15, but the load only shows about 4.
> 
> Right,. still not really sure what the bug is though.
> 
>> The patch provides a solution, by taking calculated load cpus' idle away from
>>  real effective idle. 
> 
> -ENOPARSE
> 
>> First adds a cpumask to record those cpus that alread 
>> calculated their load, and then adds a calc_unmask_cpu_load_idle to record 
>> thoses not marked cpus' go-idle load. Calc_unmask_cpu_load_idle takes place 
>> of calc_load_tasks_idle to be added into calc_load_tasks every 5HZ
> 
> 5 seconds
> 
>>  when cpu 
>> calculate its load. Go-idle load on those cpus which load alread has been calculated 
>> will only be added into calc_load_tasks_idle, no in calc_unmask_cpu_load_idle.
> 
> I don't like the solution, keeping that mask is expensive. Also, I still
> don't understand what the problem is.
> 

It's a problem. And I plan to use per-cpu var to take place of cpumask.

> A few nits on the patch below..
> 
>> ---
>>  include/linux/sched.h     |    1 +
>>  kernel/sched/core.c       |   83 ++++++++++++++++++++++++++++++++++++++++++++-
>>  kernel/time/timekeeping.c |    1 +
>>  3 files changed, 84 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 6029d8c..a2b8df2 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -145,6 +145,7 @@ extern unsigned long this_cpu_load(void);
>>  
>>
>>  extern void calc_global_load(unsigned long ticks);
>> +extern void prepare_idle_mask(unsigned long ticks);
>>  extern void update_cpu_load_nohz(void);
>>  
>>  extern unsigned long get_parent_ip(unsigned long addr);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c46958e..bdfe3c2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2164,6 +2164,7 @@ unsigned long this_cpu_load(void)
>>  /* Variables and functions for calc_load */
>>  static atomic_long_t calc_load_tasks;
>>  static unsigned long calc_load_update;
>> +static unsigned long idle_mask_update;
>>  unsigned long avenrun[3];
>>  EXPORT_SYMBOL(avenrun);
>>  
>> @@ -2199,13 +2200,38 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
>>   */
>>  static atomic_long_t calc_load_tasks_idle;
>>  
>> +/* 
>> + * Those cpus whose load alread has been calculated in this LOAD_FREQ 
>> + * period will be masked.
>> + */
>> +struct cpumask  cpu_load_update_mask;
> 
> This should be a cpumask_var_t and allocated somewhere..
> 
>> +
>> +/* 
>> + * Fold unmask cpus' idle load
>> + */
>> +static atomic_long_t calc_unmask_cpu_load_idle;
>> +
>> +
>>  void calc_load_account_idle(struct rq *this_rq)
>>  {
>>  	long delta;
>> +	int cpu = smp_processor_id();
>> +
>>  
>>  	delta = calc_load_fold_active(this_rq);
>>  	if (delta)
>> +	{
> 
> wrong bracket placement
> 
>>  		atomic_long_add(delta, &calc_load_tasks_idle);
>> +		/*
>> +		 * calc_unmask_cpu_load_idle is only used between the first cpu load accounting
>> +		 * and the last cpu load accounting in every LOAD_FREQ period, and records idle load on
>> +		 * those unmask cpus.
>> +		 */
>> +		if (!cpumask_empty(&cpu_load_update_mask) && !cpumask_test_cpu(cpu, &cpu_load_update_mask))
>> +		{
>> +			atomic_long_add(delta, &calc_unmask_cpu_load_idle);
> 
> lines are too long, brackets are placed wrong and brackets are
> superfluous.
> 
> 
>> +		}
>> +       }
>>  }
>>  
> 
> and many more similar nits.
> 

Thanks for pointing these out!

>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 6e46cac..afbc06a 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1222,6 +1222,7 @@ void do_timer(unsigned long ticks)
>>  	jiffies_64 += ticks;
>>  	update_wall_time();
>>  	calc_global_load(ticks);
>> +	prepare_idle_mask(ticks);
> 
> Why not add this to calc_global_load() ? 
> 

If some CPUs go into nohz and get up after calc_global_load,
then calc_load_account_active will be called after calc_global_load.
This will cause identifying of first calc_load_account_active cpu being
much more complicated.

>>  }
>>  
>>  /**
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/